Production LLM serving fails at the SLO boundary, not the hardware boundary. A system generating 10,000 tokens per second means nothing if 30% of those tokens arrive outside the latency window promised to the user. The metric that captures this distinction is goodput—and tuning ChunkSize (chunked prefill segment size) is one of the highest-leverage mechanisms for maximizing it.
This guide covers the mathematical basis for ChunkSize selection, the scheduler mechanics that make it work, and the concrete vLLM configuration required to implement it on specific GPU tiers.
Beyond Throughput: The Engineering Case for Goodput
Raw token-per-second throughput measures what your hardware can do. Goodput measures what your serving system actually delivers against contractual commitments. Specifically, goodput is the number of completed requests per second that adhere to defined latency-based Service Level Objectives (SLOs). As HaoAI Lab (2026) states directly:
"Throughput is not all you need: we advocate using goodput, the number of completed requests per second adhering to the Service Level Objectives (SLOs), as an improved measure of LLM serving performance."
The distinction matters because long-sequence prefill bursts create latency jitter that throughput metrics systematically ignore. A batch of 10,000 tokens processed in a single prefill step produces high throughput but stalls every concurrent decode stream for the duration—violating TPOT SLOs for those requests.
| Metric | What It Measures | What It Misses | Production Relevance |
|---|---|---|---|
| Throughput | Tokens/second across all requests | Per-request latency SLO compliance | Vanity benchmark; useful for hardware comparison only |
| Goodput | Completed requests/second meeting TTFT + TPOT SLOs | Raw hardware ceiling | Direct mapping to P95/P99 SLA commitments |
| TTFT | Time from request receipt to first generated token | Decode phase health | User-perceived responsiveness; critical for interactive use |
| TPOT | Time between each successive output token | Prefill interference | Streaming fluency; affected by colocated prefill bursts |
DistServe (OSDI '24) demonstrated that disaggregating prefill and decode phases achieves up to 4.48x higher goodput compared to state-of-the-art colocated serving systems, with 10.2x tighter SLO compliance. Those numbers are not incremental improvements—they reflect the structural cost of using throughput as your primary optimization target in colocated architectures. Once a system optimizes for raw throughput, the scheduler naturally packs in longer prefills to maximize GPU utilization, which directly degrades goodput for existing decoding sessions.
Mechanics of Prefill-Decode Disaggregation
In a colocated LLM serving system, prefill and decode execute sequentially on the same GPU resources within each iteration. This creates a structural interference problem: a single long-context prefill request can monopolize the compute kernel for hundreds of milliseconds, stalling the token generation for every concurrently decoding request in the batch.
The sequence below illustrates the request flow difference between colocated and disaggregated architectures:
sequenceDiagram
participant Client
participant Router
participant PrefillGPU as Prefill GPU (Compute-Optimized)
participant DecodeGPU as Decode GPU (Memory-Bandwidth-Optimized)
participant KVStore as KV Cache Transfer Layer
Client->>Router: POST /generate (prompt tokens)
Router->>PrefillGPU: Dispatch prefill job (full context)
Note over PrefillGPU: Parallel matrix multiply<br/>across full sequence length<br/>FLOP-intensive, no decode stall
PrefillGPU->>KVStore: Push KV cache (NVLink / RDMA)
KVStore->>DecodeGPU: Stream KV cache
Router->>DecodeGPU: Dispatch decode job
Note over DecodeGPU: Autoregressive decode<br/>Memory-bandwidth bound<br/>Isolated from prefill interference
DecodeGPU-->>Client: Stream tokens (TPOT bounded)
The key architectural insight is that prefill is compute-bound (high arithmetic intensity, benefits from tensor parallelism across large matrices) while decode is memory-bandwidth-bound (low arithmetic intensity, one token per step, repeatedly loading model weights). Colocating them forces a hardware compromise on both. Disaggregating them allows each phase to run on hardware matched to its compute profile.
As Srikarpisupati et al. (2026) note: "By disaggregating prefill and decoding computations, systems are allowed to prioritize both TTFT and TPOT instead of one or the other." This is not a theoretical claim—it is a direct consequence of eliminating head-of-line blocking in the iteration-level scheduler. Dedicated prefill kernels can process new requests without preempting active decode streams, enabling independently bounded latency SLOs for both TTFT and TPOT.
The hard constraint: disaggregation requires dedicated GPU pools for each phase, plus high-speed interconnects (NVLink or RDMA over InfiniBand) to transfer KV cache tensors between the prefill and decode nodes with acceptable latency overhead.
Decoding the ChunkSize Tuning Variable
Chunked prefill is the mechanism that makes goodput optimization tractable on colocated infrastructure—or that refines latency control on disaggregated setups. Instead of processing the full input sequence in one prefill step, the scheduler splits the prefill into discrete chunks, interleaving them with decode iterations. Each chunk represents a fixed number of prefill tokens processed per scheduler step.
The core tradeoff: smaller chunks reduce the per-step latency impact on decode streams (improving TPOT consistency) but require more iterations to complete a prefill (increasing TTFT). Larger chunks saturate GPU compute kernels more efficiently but risk decode stalls.
The theoretical optimal ChunkSize formula, given hardware constraints:
ChunkSize = floor(
(VRAM_Available_MB - KV_Cache_Reserve_MB)
/ (Token_Size_Bytes * Input_Sequence_Length)
)
Where:
- VRAM_Available_MB = Total GPU VRAM minus model weights (FP16/BF16) minus framework overhead
- KV_Cache_Reserve_MB = Memory reserved for the KV cache of concurrently decoding sequences
- Token_Size_Bytes = Per-token KV cache size = 2 * num_layers * num_heads * head_dim * dtype_bytes
- Input_Sequence_Length = P95 input sequence length from your traffic distribution
Memory Constraint: On an A100 80GB GPU running Llama-3 70B in BF16 (≈140GB weights with tensor parallelism across 2 GPUs = ≈70GB per GPU), with 8GB KV cache reserved for 32 concurrent decode streams, available VRAM for chunked prefill is approximately 2GB. For a 2K-token input with a KV cache token size of ≈1MB per token (at full precision), this yields a ChunkSize ceiling around 2,000 tokens—meaning a 2K prompt fits in one chunk. At 8K tokens, ChunkSize must be capped at 2,000 with 4 sequential prefill chunks.
This formula provides the upper bound. TTFT SLOs establish the lower bound: if your TTFT SLO is 500ms and each prefill chunk adds approximately 50ms of processing latency, you have a budget of 10 chunks maximum before breaching the SLO, regardless of VRAM headroom.
Balancing TTFT against GPU Kernel Saturation
Larger ChunkSize values increase GPU kernel occupancy. The relationship is approximately logarithmic—doubling ChunkSize from 512 to 1024 tokens yields near-linear throughput gains, but doubling from 4096 to 8192 yields diminishing returns as memory-bandwidth becomes the bottleneck rather than compute.
The scheduler override must be conditional on prompt length to avoid over-chunking short prompts (which wastes scheduling overhead) or under-chunking long prompts (which wastes kernel occupancy):
from dataclasses import dataclass
from typing import Optional
import math
@dataclass
class ChunkSchedulerConfig:
max_chunk_tokens: int # Hard upper bound (VRAM-derived)
ttft_slo_ms: float # Target TTFT SLO in milliseconds
ms_per_chunk_estimate: float # Profiled latency per chunk on target GPU
min_chunk_tokens: int = 256 # Floor to avoid scheduler overhead per iteration
def compute_chunk_size(
prompt_length: int,
config: ChunkSchedulerConfig,
) -> int:
"""
Computes the per-request ChunkSize that respects both VRAM limits
and TTFT SLO budget. Returns number of tokens per prefill chunk.
"""
# Maximum chunks allowed before breaching TTFT SLO
max_chunks_by_slo = math.floor(
config.ttft_slo_ms / config.ms_per_chunk_estimate
)
# Minimum chunk size to complete prefill within SLO chunk budget
slo_derived_chunk = math.ceil(prompt_length / max_chunks_by_slo)
# Clamp to hardware-derived upper bound and scheduler floor
optimal_chunk = max(
config.min_chunk_tokens,
min(slo_derived_chunk, config.max_chunk_tokens)
)
return optimal_chunk
# Example: A100 80GB, Llama-3 8B, targeting TTFT < 500ms
cfg = ChunkSchedulerConfig(
max_chunk_tokens=4096,
ttft_slo_ms=500.0,
ms_per_chunk_estimate=45.0, # Profiled on target hardware
min_chunk_tokens=256,
)
# For a 6K-token prompt: SLO allows floor(500/45) = 11 chunks
# ceil(6000/11) = 546 tokens/chunk — within VRAM bounds
print(compute_chunk_size(prompt_length=6000, config=cfg)) # -> 546
# For a 32K-token prompt: ceil(32000/11) = 2910 — still within 4096 VRAM bound
print(compute_chunk_size(prompt_length=32000, config=cfg)) # -> 2910
Pro-Tip: Profile
ms_per_chunk_estimateempirically for each GPU tier and model size combination. This value shifts significantly with tensor parallelism degree and KV cache pressure. Use NVIDIA Nsight Compute to measure actual kernel duration per prefill step, not synthetic benchmarks.
Implementing ChunkSize Tuning in vLLM
vLLM's iteration-level scheduler controls chunked prefill through max_num_batched_tokens, which sets the token budget per scheduler step across all requests. This is the primary knob for ChunkSize in production deployments. The default value of 4096 is a reasonable starting point for 7B-class models on A100 hardware, but requires explicit tuning for larger models or different GPU tiers.
Technical Warning: Chunked prefill in vLLM requires version 0.4.x or later. Earlier versions process prefill monolithically, which eliminates the ability to interleave decode iterations and directly causes the TPOT jitter described above. Verify your version with
python -c "import vllm; print(vllm.__version__)".
The following configuration targets an H100 80GB SXM GPU running Llama-3 70B with tensor parallelism across 4 GPUs:
from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
# H100 80GB SXM: ~80GB VRAM per GPU, TP=4 splits 140GB model across 4 cards
# Each GPU holds ~35GB of weights, leaving ~45GB for KV cache + activation
engine_args = AsyncEngineArgs(
model="meta-llama/Meta-Llama-3-70B-Instruct",
tensor_parallel_size=4, # 4x H100 80GB SXM via NVLink
# Context window: set to your P99 input+output length, not model max
# Larger values reserve more KV cache pages, reducing prefill headroom
max_model_len=16384, # 16K context; reduces from model default 128K
# Primary ChunkSize control: tokens processed per scheduler iteration
# Derived from: VRAM headroom / token_kv_size, capped by TTFT SLO budget
max_num_batched_tokens=8192, # Tuned for H100 memory bandwidth at TP=4
# Concurrent sequences in flight (prefill + decode combined)
# Increasing this beyond VRAM headroom causes KV cache eviction (thrashing)
max_num_seqs=128, # Conservative for 70B to avoid OOM under burst
# Enable chunked prefill — required for iteration-level interleaving
enable_chunked_prefill=True,
# PagedAttention block size: 16 is default, 32 reduces fragmentation at cost
# of finer-grained KV cache memory management
block_size=16,
# GPU memory fraction reserved for KV cache (remainder = model weights + overhead)
gpu_memory_utilization=0.90, # Leave 10% for CUDA context and framework overhead
)
# Sampling config: representative of production interactive use case
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=512, # P95 output length for your workload
)
For smaller GPU tiers (A10G 24GB, single GPU, Llama-3 8B):
engine_args_a10g = AsyncEngineArgs(
model="meta-llama/Meta-Llama-3-8B-Instruct",
tensor_parallel_size=1,
max_model_len=8192, # A10G: 24GB VRAM, 16GB weights (BF16), 8GB KV cache
max_num_batched_tokens=2048, # Smaller chunk: A10G memory bandwidth is ~600 GB/s vs H100's ~3.35 TB/s
max_num_seqs=32,
enable_chunked_prefill=True,
block_size=16,
gpu_memory_utilization=0.88,
)
Validating Goodput via SLO-Aware Benchmarking
Standard load testing tools measure average latency. SLO validation requires P95 and P99 tail latency breakdowns for both TTFT and TPOT independently, plus goodput calculation (requests/second that satisfy both SLOs simultaneously). Benchmarking must also account for queueing delays that throughput tests systematically ignore.
The following locust task measures TTFT and TPOT against configurable SLO thresholds:
# locustfile.py — SLO-aware goodput measurement for vLLM OpenAI-compatible endpoint
import time
import json
import requests
from locust import HttpUser, task, between, events
from collections import defaultdict
SLO_TTFT_MS = 500.0 # Maximum acceptable Time To First Token
SLO_TPOT_MS = 50.0 # Maximum acceptable Time Per Output Token
slo_pass = 0
slo_fail = 0
class LLMGoodputUser(HttpUser):
wait_time = between(0.1, 0.5) # Simulates Poisson-ish inter-arrival time
@task
def generate_stream(self):
global slo_pass, slo_fail
payload = {
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"prompt": "Explain the concept of transformer attention mechanisms in detail.",
"max_tokens": 256,
"stream": True,
"temperature": 0.7,
}
request_start = time.perf_counter()
first_token_time: float = None
token_times = []
last_token_time = request_start
with self.client.post(
"/v1/completions",
json=payload,
stream=True,
catch_response=True,
) as response:
for line in response.iter_lines():
if line and line.startswith(b"data:"):
chunk_str = line[5:].strip()
if chunk_str == b"[DONE]":
break
now = time.perf_counter()
if first_token_time is None:
first_token_time = now # Capture TTFT timestamp
else:
# Each subsequent token measures TPOT
token_times.append((now - last_token_time) * 1000)
last_token_time = now
if first_token_time is None:
response.failure("No tokens received")
return
ttft_ms = (first_token_time - request_start) * 1000
avg_tpot_ms = sum(token_times) / len(token_times) if token_times else 0
# SLO compliance check — both conditions must pass for goodput credit
if ttft_ms <= SLO_TTFT_MS and avg_tpot_ms <= SLO_TPOT_MS:
slo_pass += 1
response.success()
else:
slo_fail += 1
response.failure(
f"SLO breach: TTFT={ttft_ms:.1f}ms, TPOT={avg_tpot_ms:.1f}ms"
)
@events.quitting.add_listener
def print_goodput_summary(environment, **kwargs):
total = slo_pass + slo_fail
goodput_pct = (slo_pass / total * 100) if total > 0 else 0
print(f"\n=== Goodput Summary ===")
print(f"Total Requests : {total}")
print(f"SLO Pass : {slo_pass} ({goodput_pct:.1f}%)")
print(f"SLO Fail : {slo_fail}")
print(f"Target Goodput : >95% — {'PASS' if goodput_pct >= 95 else 'FAIL'}")
Run with:
locust -f locustfile.py --headless \
--host http://localhost:8000 \
--users 50 \
--spawn-rate 5 \
--run-time 120s \
--html goodput_report.html
Strategies for Mitigating Memory-Bandwidth Contention
When full disaggregation is operationally viable, prefill and decode phases should run on separate GPU pools with hardware matched to their respective compute profiles. Disaggregation allows scaling each phase independently based on observed traffic patterns—a key operational advantage, since prefill demand spikes with prompt length distributions while decode demand scales with concurrency.
The communication overhead between disaggregated GPU pools is the primary latency risk. NVLink (within a node) adds ≈1-2μs per GB transferred. For a 70B model with a 4K-token KV cache at BF16 precision, inter-node KV transfer overhead is approximately 100-200ms over InfiniBand HDR at 200 Gb/s—a cost that must be factored into the TTFT SLO budget.
The following scheduler policy implements a CUDA-device-aware dispatch that routes requests based on phase:
# disaggregated_scheduler.py
# Requires: vLLM multi-instance deployment with Redis-based request routing
import asyncio
import hashlib
import aioredis
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.engine.arg_utils import AsyncEngineArgs
from dataclasses import dataclass
from typing import Dict
@dataclass
class DisaggregatedEnginePool:
prefill_engine: AsyncLLMEngine # Compute-optimized GPU (e.g., H100 SXM)
decode_engine: AsyncLLMEngine # Memory-bandwidth-optimized GPU (e.g., A100 HBM2e)
def build_prefill_engine() -> AsyncLLMEngine:
args = AsyncEngineArgs(
model="meta-llama/Meta-Llama-3-8B-Instruct",
tensor_parallel_size=1,
max_num_batched_tokens=16384, # Large chunks: prefill saturates compute kernels
max_num_seqs=8, # Low concurrency: prefill is batch-size-insensitive
enable_chunked_prefill=False, # Full prefill per step on dedicated hardware
gpu_memory_utilization=0.95,
device="cuda:0", # Dedicated prefill GPU
)
return AsyncLLMEngine.from_engine_args(args)
def build_decode_engine() -> AsyncLLMEngine:
args = AsyncEngineArgs(
model="meta-llama/Meta-Llama-3-8B-Instruct",
tensor_parallel_size=1,
max_num_batched_tokens=2048, # Smaller budget: decode is one token per request
max_num_seqs=128, # High concurrency: decode is memory-bandwidth bound
enable_chunked_prefill=False, # No prefill on decode GPU
gpu_memory_utilization=0.90,
device="cuda:1", # Dedicated decode GPU
)
return AsyncLLMEngine.from_engine_args(args)
async def route_request(
pool: DisaggregatedEnginePool,
prompt: str,
request_id: str,
sampling_params,
) -> str:
"""
Phase 1: Run full prefill on compute-optimized GPU.
Phase 2: Transfer KV state and run decode on bandwidth-optimized GPU.
NOTE: This is a simplified illustration; production requires KV cache
serialization (e.g., via safetensors) and shared storage (e.g., NVMe-oF).
"""
# Prefill phase — generate KV cache on prefill GPU
prefill_outputs = []
async for output in pool.prefill_engine.generate(
prompt, sampling_params, request_id=f"{request_id}-prefill"
):
prefill_outputs.append(output)
if output.outputs[0].finish_reason is not None:
break # Prefill-only run: stop after context is built
# In production: serialize KV cache, transfer to decode GPU via NVLink/RDMA
# Decode phase — resume generation on memory-bandwidth-optimized GPU
decode_outputs = []
async for output in pool.decode_engine.generate(
prompt, sampling_params, request_id=f"{request_id}-decode"
):
decode_outputs.append(output)
if output.outputs[0].finish_reason is not None:
break
return decode_outputs[-1].outputs[0].text if decode_outputs else ""
Technical Warning: Full KV cache transfer between GPU instances requires either shared GPU memory (NVLink with peer-to-peer access) or explicit serialization to host memory and back. The latency cost of naive host-mediated transfer (PCIe Gen4 ≈64 GB/s bidirectional) is prohibitive for large KV caches. Ensure NVLink peer access is enabled:
torch.cuda.can_device_access_peer(src, dst)must returnTruebefore deploying this pattern.
Predictive Scaling based on Request Inter-arrival Time
Static ChunkSize configurations fail under bursty traffic. Request inter-arrival time spikes directly correlate with goodput degradation: when N requests arrive simultaneously, the prefill queue depth grows faster than the scheduler can drain it, causing TTFT SLO breaches even with correctly tuned chunks. This is a consequence of the Poisson nature of real-world API traffic—variance in arrival rates exceeds what static batch sizing can absorb.
flowchart TD
A[Request Arrival Monitor] --> B{Inter-arrival time\n< threshold?}
B -- Yes: Burst Detected --> C[Reduce ChunkSize\nDecrease max_num_batched_tokens\nPrioritize decode continuity]
B -- No: Normal Load --> D[Maintain or Increase ChunkSize\nSaturate GPU compute kernels\nMaximize TTFT efficiency]
C --> E[Update Scheduler Config\nvia AsyncEngineArgs hot-reload]
D --> E
E --> F[Measure P95 TTFT + TPOT\nover rolling 30s window]
F --> G{Both SLOs met?}
G -- Yes --> H[Hold current ChunkSize\nLog goodput metric]
G -- No: TTFT breach --> I[Reduce ChunkSize by 25%]
G -- No: TPOT breach --> J[Increase ChunkSize by 25%]
I --> E
J --> E
H --> A
Dynamic chunk resizing requires Poisson-rate estimation from observed inter-arrival times. A simple exponential moving average of inter-arrival deltas, computed per-second by the request router, provides sufficient signal for a proportional controller. Pair this with a sliding-window SLO compliance rate (goodput %) as the feedback signal—if goodput drops below 95% over the last 30 seconds, trigger a ChunkSize adjustment in the direction dictated by which SLO is breaching.
Synthesizing Production-Grade Serving Infrastructure
All optimizations above must be verifiable through automated regression testing. A ChunkSize tuned for stable traffic becomes a latency hazard when model context window or traffic distributions shift. CI/CD pipelines must run SLO regression benchmarks against each configuration change before promotion to production.
The integration pattern: run the locust goodput benchmark as a pre-deployment gate in your CI pipeline. Fail the deployment if goodput drops below 95% at target concurrency. This converts an operational tuning exercise into a codified, auditable constraint.
Production Deployment Audit Checklist:
- [ ] TTFT SLO ≤ 500ms at P95 — Verified via SLO-aware load test at production concurrency level
- [ ] TPOT SLO ≤ 50ms at P95 — Verified independently from TTFT; a TPOT breach indicates prefill-decode interference
- [ ] Goodput ≥ 95% of requests — Both TTFT and TPOT SLOs met simultaneously per request
- [ ] vLLM ≥ 0.4.x — Chunked prefill requires iteration-level scheduler; verify before deployment
- [ ]
enable_chunked_prefill=True— Explicitly set; not default in all model configurations - [ ]
max_num_batched_tokenstuned per GPU tier — Do not use default 4096 without profiling on target hardware - [ ]
max_model_lenscoped to P99 traffic — Over-provisioning reserves unnecessary KV cache pages - [ ]
gpu_memory_utilization≤ 0.92 — Reserve headroom for CUDA context and burst spike buffering - [ ] NVLink peer access verified (disaggregated only) —
torch.cuda.can_device_access_peer()returnsTrue - [ ] CUDA 12.x+ installed — Required for kernel-level chunked prefill optimizations
- [ ] P95/P99 tail latency logged separately for TTFT and TPOT — Aggregate latency metrics mask SLO breaches
- [ ] Goodput regression test integrated into CI/CD — Automated gate; blocks promotion on SLO regression
- [ ] Dynamic ChunkSize controller enabled (bursty traffic) — Static configuration insufficient for Poisson-distributed arrivals
- [ ] KV cache eviction rate monitored — Non-zero eviction indicates
max_num_seqsis too high for available VRAM
The shift from throughput-optimized to goodput-optimized serving infrastructure is not a single configuration change—it is a systematic reorientation of what the system is measured against. Every component of the scheduler, from ChunkSize to batch limits to memory allocation, must be evaluated against SLO compliance rates rather than raw token generation speed. The math is unambiguous: DistServe's 4.48x goodput improvement over colocated systems is achievable only when the infrastructure is designed from the beginning to treat per-request latency SLOs as first-class constraints.
Keywords: Iteration-level scheduling, Prefill-decode disaggregation, Goodput vs Throughput metrics, Memory-bandwidth contention, TTFT (Time To First Token), TPOT (Time Per Output Token), vLLM PagedAttention, CUDA kernel occupancy, SLA-based inference optimization, Request jitter mitigation