At a glance: what you need before serving Mixtral or DeepSeek-V3
At a Glance: Time: ~60 min to first request · Prereqs: Python 3.10+, CUDA 12.1+,
uvinstalled · Hardware: 2× H100 80 GB minimum for Mixtral 8x7B; 8× H100/H200 80 GB or multi-node for DeepSeek-V3 (671B total, 37B activated per token) · Install path:uv pip install -U vllm· Parallelism: tensor + expert parallelism required for both models; expert parallelism must be explicitly enabled at launch
vLLM describes itself as a "high-throughput and memory-efficient inference and serving engine for LLMs" and natively supports Mixture-of-Experts architectures including Mixtral 8x7B, Mixtral 8x22B, DeepSeek-V3, DeepSeek-V3.1, and DeepSeek-V3.2-Exp. The vLLM docs confirm support for tensor, pipeline, data, expert, and context parallelism across distributed inference — but the configuration surface for MoE models is model-specific and backend-specific. Copying a generic vllm serve command without adjusting parallelism topology and batching limits is the most common path to OOMs and stalled deployments.
The practical readiness checklist before you proceed:
- Model size vs. GPU VRAM: Mixtral 8x7B fits on 2–4× A100/H100 80 GB with BF16 weights; Mixtral 8x22B needs 4–8 GPUs. DeepSeek-V3's 671B parameter count (37B activated per token) demands a carefully planned multi-GPU or multi-node configuration — the activated footprint is manageable, but the full weight load and KV cache add up fast.
- Expert parallelism is a runtime flag: It is not inferred from the model checkpoint. You must enable it explicitly.
- Backend alignment: NVIDIA GPU, AMD GPU, and Huawei Ascend each have distinct launch paths. Mixing flags from different backend docs is a silent misconfiguration risk.
Set up vLLM and confirm MoE model support
vLLM supports MoE models explicitly. The docs homepage states: "Mixture-of-Expert LLMs (e.g., Mixtral, DeepSeek-V3, Qwen-MoE, GPT-OSS)" alongside tensor, pipeline, data, expert, and context parallelism as first-class features. MoE support is not an experimental extension — it is part of the main serving surface.
Production Note: Choose your hardware backend before installing. The standard
vllmwheel targets NVIDIA CUDA. AMD GPUs, Huawei Ascend (viavllm-ascend), and other accelerators require a different package path or plugin. Starting with the wrong backend means a reinstall before you serve a single token.
Install from uv or build from source
The recommended path uses uv for faster dependency resolution:
$ uv venv .venv --python 3.11
$ source .venv/bin/activate
$ uv pip install -U vllm --torch-backend auto
The --torch-backend auto flag, shown in the DeepSeek-V3.1 recipe, lets the installer select the correct PyTorch build for your detected CUDA version. For a straight pip install without uv:
$ pip install vllm
For development workflows where you need to modify engine internals or debug backend routing, build from source instead:
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -e ".[dev]" # editable install with development extras
Source builds keep your edits and the running engine in sync — mandatory if you plan to patch expert routing or add observability hooks.
Verify the server entrypoint and OpenAI-compatible API
Before loading any MoE checkpoint, confirm the serving surface works on a small model to avoid wasting GPU memory on a misconfigured stack. The vLLM OpenAI-compatible server docs confirm the server starts with vllm serve:
$ vllm serve facebook/opt-125m --port 8000
Once the process is up, validate the API shape with a minimal chat completion:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="placeholder")
response = client.chat.completions.create(
model="facebook/opt-125m",
messages=[{"role": "user", "content": "ping"}],
max_tokens=8,
)
print(response.choices[0].message.content)
If this returns any coherent text, the HTTP layer and OpenAI-compatible routing are functional. Only then should you load a MoE checkpoint into the same serving stack.
Serve Mixtral with the right expert-parallel settings
Mixtral's sparse MoE design — each token routes to two experts per layer — means the router network runs on every forward pass, but only a fraction of total parameters activate. This gives Mixtral faster inference at low batch sizes and higher throughput at large batch sizes compared to dense models of equivalent parameter count, but it concentrates memory pressure on the KV cache and routing paths rather than weight loading alone.
The critical serving configuration for Mixtral 8x22B on a large multi-GPU node:
model: mistralai/Mixtral-8x22B-v0.1
tensor_parallel_size: 8 # shards attention and FFN weights across 8 GPUs
enable_expert_parallel: true # routes experts to dedicated GPU subsets
max_model_len: 32768 # align to your actual context budget
max_num_batched_tokens: 8192 # start conservative; raise after stability check
gpu_memory_utilization: 0.90 # reserve 10% for fragmentation headroom
Launch with these flags directly on the CLI:
$ vllm serve mistralai/Mixtral-8x22B-v0.1 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--max-model-len 32768 \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.90 \
--port 8000
Choose tensor parallelism versus expert parallelism for Mixtral
vLLM's expert parallel deployment docs state: "vLLM supports Expert Parallelism (EP), which allows experts in Mixture-of-Experts (MoE) models to be deployed on separate GPUs, increasing locality, efficiency, and throughput overall."
The decision between tensor parallelism (TP) and expert parallelism (EP) is not binary — in most production deployments you use both, but you tune their ratio based on where your bottleneck lives:
- Use EP when routing locality is the bottleneck. If experts for a given token are scattered across GPUs, cross-device all-to-all communication dominates latency. EP pins sets of experts to specific GPUs, reducing that traffic.
- Use more TP when per-GPU weight footprint is the bottleneck. As the vLLM performance docs explain, increasing
tensor_parallel_sizeshards model weights so each GPU has more memory available for the KV cache.
Pro Tip: For Mixtral on a multi-GPU NVLink or NVSwitch system, start with
--tensor-parallel-size 4or--tensor-parallel-size 8depending on the model shard size, and enable--enable-expert-parallel. Measure end-to-end latency with and without EP before committing to a topology.Watch Out: Over-splitting experts across more devices than your GPU interconnect can efficiently service inverts the locality benefit. On a PCIe-only node without NVLink, aggressive EP configurations can increase p95 latency compared to pure TP. Measure both before committing.
Tune batching and context limits for expert-heavy traffic
Two knobs dominate MoE serving stability: max_num_batched_tokens and gpu_memory_utilization.
Per the vLLM engine args docs, --max-num-batched-tokens sets the maximum number of tokens processed in a single engine iteration. Under expert-heavy traffic where multiple long-context requests run simultaneously, this ceiling determines whether the KV cache overflows.
# Conservative baseline for Mixtral 8x22B under mixed context traffic
max_model_len: 32768 # cap context to what your node can sustain
max_num_batched_tokens: 4096 # reduce if you see cache pressure OOMs
max_num_seqs: 64 # concurrent sequence ceiling
gpu_memory_utilization: 0.88 # conservative; raise to 0.92 once stable
The vLLM optimization docs confirm: "vLLM pre-allocates GPU cache using this percentage of memory. By increasing utilization, you can provide more KV cache space." But raising gpu_memory_utilization without reducing batching limits is the fastest path to fragmentation-induced crashes under burst load.
Check the model with a minimal request
After launch, smoke-test before enabling production traffic:
$ curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "mistralai/Mixtral-8x22B-v0.1",
"messages": [{"role": "user", "content": "Say OK"}],
"max_tokens": 4}'
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="placeholder")
# Single-token round-trip confirms routing and sampling pipeline
resp = client.chat.completions.create(
model="mistralai/Mixtral-8x22B-v0.1",
messages=[{"role": "user", "content": "Say OK"}],
max_tokens=4,
)
assert resp.choices[0].message.content.strip(), "Empty response — check backend logs"
print("Serving confirmed:", resp.choices[0].message.content)
A successful response confirms weight loading, expert routing, and the HTTP layer are all operational. Latency instability at this stage points to batching or memory configuration, not model correctness.
Serve DeepSeek-V3 with expert parallelism and hardware-aware flags
DeepSeek-V3 is "a strong Mixture-of-Experts (MoE) language model with 671B total parameters and 37B activated for each token." That activated-parameter footprint keeps per-token compute manageable, but the full weight matrix and KV cache at scale require careful GPU topology planning before the first vllm serve call. Competitor documentation for DeepSeek-V3 serving is scattered across the vLLM recipes repository, the Ascend-specific tutorial, and the main EP docs — the following sections unify those paths.
Production Note: DeepSeek-V3's weight size means a single-node NVIDIA deployment needs substantial memory headroom. Multi-node configurations require explicit process placement and high-bandwidth interconnect (InfiniBand or NVLink-scale fabric). Ascend deployments use the separate
vllm-ascendpackage with its own version gate — do not mix install paths.
Pick the deployment shape for your GPU node
The vLLM expert parallel deployment docs state that EP "is typically coupled with Data Parallelism (DP)." For DeepSeek-V3, the DeepSeek-V3.2-Exp recipe shows this pattern concretely using -dp 8 --enable-expert-parallel.
Single-node shape:
# DeepSeek-V3 single-node deployment shape
model: deepseek-ai/DeepSeek-V3
tensor_parallel_size: 8
data_parallel_size: 1 # scale up for multi-replica single-node
enable_expert_parallel: true
max_model_len: 16384 # reduce if KV cache OOMs appear
max_num_batched_tokens: 4096
gpu_memory_utilization: 0.88
Multi-node shape:
# DeepSeek-V3 multi-node deployment shape — node 0 and node 1 must be coordinated
tensor_parallel_size: 8
data_parallel_size: 2 # one DP replica per node
enable_expert_parallel: true
max_model_len: 32768
max_num_batched_tokens: 8192
gpu_memory_utilization: 0.90
Choose TP, DP, and EP together as a deployment shape — not independently. A mismatch in any dimension creates either memory oversubscription or cross-device routing storms.
Apply backend-specific flags for NVIDIA or Ascend
For NVIDIA GPUs, the standard vLLM install path applies. The vLLM README confirms support for "NVIDIA GPUs, AMD GPUs, and x86/ARM/PowerPC CPUs" plus a range of hardware plugins.
# NVIDIA H100 / H200 — standard vllm serve invocation for DeepSeek-V3
$ vllm serve deepseek-ai/DeepSeek-V3 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
-dp 8 \
--max-model-len 16384 \
--max-num-batched-tokens 4096 \
--gpu-memory-utilization 0.88 \
--port 8000
For Huawei Ascend, the path diverges at install time. The vLLM-Ascend DeepSeek-V3.1 tutorial states: "The DeepSeek-V3.1 model is first supported in vllm-ascend:v0.9.1rc3." Install the Ascend-specific package and use its documented flags — do not apply NVIDIA-centric environment variables to an Ascend runtime.
Production Note: DeepSeek-V3 backend choice is a hard gate: install
vllm-ascendfor Ascend NPUs and verify the version pin matches the model revision you are serving. The NVIDIA and Ascend serving paths share the same CLI surface but differ in required environment flags, quantization formats, and parallelism constraints. Copying flags between backend docs is the most common misconfiguration pattern for DeepSeek-V3 deployments.
Validate throughput and routing under load
Single-request smoke tests confirm loading and routing pipeline, but expert imbalance and cache pressure only manifest under realistic concurrency. The vLLM-Ascend long-sequence tutorial notes that max_num_batched_tokens "represents the maximum number of tokens that the model can process in a single step" — both directly affect whether routing stays stable as concurrency grows.
Enable verbose logging for the load test:
$ VLLM_LOGGING_LEVEL=DEBUG vllm serve deepseek-ai/DeepSeek-V3 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--max-num-batched-tokens 4096 \
--port 8000 2>&1 | tee vllm_serve.log
Run a concurrency ramp to expose routing instability:
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="placeholder")
async def single_request(i: int) -> str:
resp = await client.chat.completions.create(
model="deepseek-ai/DeepSeek-V3",
messages=[{"role": "user", "content": f"Count to {i}"}],
max_tokens=64,
)
return resp.choices[0].message.content
async def load_test(n_concurrent: int = 16) -> None:
results = await asyncio.gather(*[single_request(i) for i in range(n_concurrent)])
failed = sum(1 for r in results if not r)
print(f"Completed: {n_concurrent - failed}/{n_concurrent} requests")
asyncio.run(load_test(16))
If throughput regresses only under concurrent load — not on single requests — investigate expert imbalance, batching ceilings, and KV cache pressure in that order before changing model weights or topology.
Troubleshoot memory pressure, expert imbalance, and routing failures
MoE models load successfully and then fail under traffic for reasons that have nothing to do with weight correctness. The vLLM performance docs name the levers directly: "Decrease max_num_seqs or max_num_batched_tokens. This can reduce the number of concurrent requests in a batch, thereby requiring less KV cache space." The failure modes cluster into three categories: memory pressure from KV cache oversubscription, expert imbalance from skewed routing, and distributed coordination failures from bad parallelism topology.
Watch Out: The three failure modes produce overlapping symptoms — OOMs, latency spikes, and throughput collapse all appear in each category. Misdiagnosing memory pressure as expert imbalance (or vice versa) leads to the wrong fix. Read GPU memory utilization metrics and per-expert activation counts together before adjusting any knob.
Read the symptoms of bad parallelism choices
Pro Tip: OOMs at server startup often indicate the per-GPU weight footprint is too large — increase
tensor_parallel_sizeor reducemax_model_len. OOMs that appear only after sustained traffic often indicate KV cache oversubscription — reducemax_num_batched_tokensormax_num_seqs. Latency spikes that correlate with batch size but not request count suggest cross-device routing overhead from over-split experts.
Concrete symptom mapping:
- OOM at startup:
tensor_parallel_sizeis too small for the weight matrix. Increase TP or switch to a quantized checkpoint. - OOM under traffic, model loads fine:
max_num_batched_tokensorgpu_memory_utilizationis set too aggressively for the actual KV cache demand. - p95 latency spikes without OOM: Expert routing is crossing device boundaries inefficiently. Verify that
enable_expert_parallelis set and that the EP/TP ratio matches your GPU interconnect topology. - Throughput plateau at low concurrency:
max_num_seqsis the ceiling — raise it if memory headroom exists, lower it if latency is the constraint.
Fix OOMs without destroying throughput
Recovering from OOMs while preserving throughput requires adjusting multiple knobs together rather than one in isolation:
# Safer configuration after OOM — tighten cache and batching, increase sharding
tensor_parallel_size: 8 # was 4 — shards weights, frees KV cache memory
gpu_memory_utilization: 0.85 # was 0.92 — leaves more room for fragmentation
max_num_batched_tokens: 2048 # was 8192 — reduces peak cache demand per iteration
max_num_seqs: 32 # was 128 — limits concurrent request footprint
max_model_len: 16384 # was 32768 — reduce if long contexts drive OOMs
The vLLM performance docs confirm: "Increase tensor_parallel_size. This approach shards model weights, so each GPU has more memory available for KV cache." Combining higher TP with lower batching limits is the most reliable path to OOM recovery without destroying throughput — reducing only one dimension rarely resolves the underlying oversubscription.
Handle expert imbalance and hot-token routing
Mixtral routes each token to exactly two experts per layer; DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing at training time, but neither approach eliminates runtime imbalance under skewed inference traffic. When a small set of prompts disproportionately routes to the same experts, those GPUs saturate while others idle.
Pro Tip: If GPU utilization is uneven across the expert-parallel group — one or two devices at 95%+ while others sit at 40–60% — you have expert imbalance, not a memory problem. Reduce
max_num_batched_tokensto lower the per-iteration pressure on hot experts. If the traffic pattern is domain-specific (e.g., all code or all math), consider batching diversity by mixing request types, or capping concurrent requests from the same client to prevent routing monopolization.
Traffic shaping at the load balancer level — rate-limiting per-topic or per-domain request bursts — is sometimes more effective than engine-side tuning for pathological routing distributions.
Production checks before exposing the OpenAI-compatible endpoint
Since vLLM's server is compatible with the OpenAI API and acts as a drop-in replacement, clients migrate without code changes. The risk is that API compatibility masks serving instability — a client that works against OpenAI's infrastructure can still trigger OOMs and latency regressions against a misconfigured MoE deployment.
Production Note: Do not expose the endpoint externally until this checklist passes: (1) GPU budget confirmed against model weight size plus KV cache headroom, (2)
max_model_lenvalidated against actual p99 prompt+response length, (3)max_num_batched_tokensstress-tested at 2× expected peak concurrency, (4) at least one full concurrency ramp completed without OOM or latency collapse, (5) OpenAI-compatible API validated against the client SDK you will use in production.
Pre-promotion checklist:
- [ ] GPU VRAM budget accounts for weights + KV cache at max concurrency
- [ ]
max_model_lenmatches the actual context window your workload needs - [ ]
max_num_batched_tokensis stable under 2× expected peak QPS - [ ] Expert parallelism flags explicitly set and confirmed active in server logs
- [ ] Backend package matches the hardware (standard
vllmfor NVIDIA,vllm-ascendfor Ascend) - [ ] Observability (logging, metrics endpoint) enabled on the production backend
- [ ] Health check endpoint returns 200 under load
Set sane defaults for latency and capacity
The vLLM engine args docs note that the default value for max_num_batched_tokens "is mainly for convenience when testing" — meaning the testing default is not a production-safe value for MoE models under real traffic.
Production Note: For MoE serving, capacity planning should use the worst-case routing and context profile, not average traffic. A workload that is 90% short requests and 10% long-context requests will have its OOM behavior driven entirely by the 10%. Start with conservative batching settings and raise them incrementally once you have real p99 measurements.
Conservative production baseline:
gpu_memory_utilization: 0.88 # raise to 0.92 only after 48h stable
max_num_batched_tokens: 4096 # raise in 2048 increments after measuring p99
max_num_seqs: 64 # raise only once OOM recovery path is tested
max_model_len: 16384 # match to actual workload, not model maximum
Confirm client compatibility and observability
vLLM provides an HTTP server that implements OpenAI's Completions API, Chat API, and more. Client compatibility issues almost always stem from mismatched server flags (wrong model name, missing --served-model-name alias) rather than API shape incompatibility.
$ curl http://localhost:8000/v1/models # confirm the served model name matches what clients expect
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="placeholder")
# Validates both API shape and that routing returns consistent results
for _ in range(3):
resp = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V3", # must match --served-model-name if aliased
messages=[{"role": "user", "content": "What is 2+2?"}],
max_tokens=16,
)
print(resp.choices[0].message.content)
Three consistent responses confirms that routing, sampling, and the HTTP layer are all deterministically functional. If throughput is lower than expected from the smoke test, revisit batching and parallelism settings before investigating the client.
Frequently asked questions about vLLM expert parallelism
What is expert parallelism in vLLM?
Expert parallelism distributes the experts of an MoE layer across separate GPUs, so each GPU hosts a subset of experts. This improves routing locality — tokens routed to a given expert do not require cross-device weight transfers. vLLM implements EP as an explicit runtime flag (--enable-expert-parallel) and typically couples it with data parallelism for full deployment coverage; see the expert parallel deployment docs.
Does vLLM support MoE models?
Yes. The vLLM docs explicitly list Mixtral, DeepSeek-V3, Qwen-MoE, and GPT-OSS as supported MoE architectures, with expert parallelism available as a first-class distributed inference mode.
How do you serve Mixtral with vLLM?
Set --tensor-parallel-size to match your GPU count, add --enable-expert-parallel, tune --max-model-len and --max-num-batched-tokens for your context budget, and adjust --gpu-memory-utilization to leave fragmentation headroom. The exact values depend on whether you are serving 8x7B or 8x22B and your GPU VRAM budget; the Mixtral paper and the vLLM engine args docs cover the architecture and serving knobs.
How do you serve DeepSeek-V3 with vLLM?
DeepSeek-V3 requires explicit EP and DP configuration alongside TP. Use -dp to set data parallelism replicas, --enable-expert-parallel, and --tensor-parallel-size 8 as a baseline for single-node deployments. For Ascend backends, install vllm-ascend at v0.9.1rc3 or later per the DeepSeek-V3.1 Ascend tutorial.
What hardware is needed for DeepSeek-V3 serving?
Pro Tip: DeepSeek-V3 has 671B total parameters with 37B activated per token. On NVIDIA hardware, plan for a multi-GPU or multi-node deployment with enough VRAM for the full weight set plus KV cache. Do not attempt single-GPU or 2–4 GPU serving without a quantized checkpoint that fits your VRAM budget; see the DeepSeek-V3 technical report for the model size breakdown and the vLLM documentation for distributed inference support.
What is the difference between EP and TP for MoE models?
TP shards individual weight matrices across GPUs, reducing per-GPU weight footprint. EP keeps entire experts on single GPUs but distributes different experts to different devices, improving routing locality. Most production MoE deployments use both: TP to fit the model in memory, EP to reduce cross-device all-to-all traffic during inference; the expert parallel deployment docs and the performance docs explain the tradeoff.
Sources and references
- vLLM GitHub Repository — Primary source for installation, MoE model support, hardware backends, and OpenAI-compatible server documentation
- vLLM Documentation: Expert Parallel Deployment — Official guide to enabling and configuring EP in vLLM
- vLLM Documentation: Engine Arguments — Reference for
max_num_batched_tokens,max_model_len,gpu_memory_utilization, and related serving flags - vLLM Documentation: Optimization — KV cache and batching tuning guidance for production serving
- vLLM Performance Docs (v0.5.5) — Concrete recommendations for
tensor_parallel_size,max_num_seqs, and memory management - vLLM OpenAI-Compatible Server Docs — Server entrypoint and API compatibility reference
- vLLM Quickstart — Baseline serving walkthrough and client compatibility notes
- vLLM Recipes: DeepSeek-V3.1 — Practical install and launch recipe for DeepSeek-V3.1 on NVIDIA hardware
- vLLM Recipes: DeepSeek-V3.2-Exp — Expert-parallel launch pattern with
-dpand--enable-expert-parallel - vLLM-Ascend: DeepSeek-V3.1 Tutorial — Ascend-specific serving path and version gate for DeepSeek-V3.1
- vLLM-Ascend: Long Sequence Context Parallel Tutorial — ChunkPrefill, SplitFuse, and
max_num_batched_tokensbehavior under v1 scheduling - Mixtral of Experts — arXiv:2401.04088 — Original Mixtral paper: architecture, routing (top-2 experts per layer), and benchmark results
- DeepSeek-V3 Technical Report — arXiv:2412.19437 — DeepSeek-V3 architecture, 671B/37B parameter breakdown, and auxiliary-loss-free load balancing strategy
Keywords: vLLM, Mixtral 8x7B, Mixtral 8x22B, DeepSeek-V3, DeepSeek-V3.1, DeepSeek-V3.2-Exp, expert parallelism, tensor parallelism, data parallelism, PagedAttention, continuous batching, chunked prefill, prefix caching, MoE serving, Mixture-of-Experts, OpenAI-compatible API, NVIDIA H100, NVIDIA H200, gpu_memory_utilization, max_num_batched_tokens, vllm serve, vllm-ascend


