Modern LLM serving under production load exposes a fundamental hardware mismatch: the two phases of inference—prefill and decode—have diametrically opposed resource profiles, yet monolithic deployments force them to compete for the same silicon. The consequence is predictable: neither phase runs efficiently, P99 TTFT bloats under concurrency, and GPU utilization figures that look reasonable in dashboards mask deep structural waste. Disaggregated inference solves this by partitioning the compute graph into dedicated pools, each sized and scheduled for its actual workload. The llm-d framework on AWS operationalizes this architecture in Kubernetes-native environments, pairing it with NIXL and EFA to make KV-cache transfer fast enough to justify the coordination overhead.
The Mechanical Limit: Why Monolithic Inference Fails at Scale
LLM inference is not a single computational workload. Prefill processes the entire input prompt in a single parallel pass—it is compute-bound, saturating tensor core utilization and demanding peak FLOP/s. Decode, by contrast, generates one token per forward pass, gated entirely by how fast the GPU can read KV-cache tensors from HBM—it is memory-bandwidth-bound. These two profiles are irreconcilable on shared hardware, leading to the widely recognized performance degradation cited in recent industry research (Source: Awesome Agents, 2026).
| Characteristic | Prefill Phase | Decode Phase |
|---|---|---|
| Bottleneck | Compute (FLOP/s) | Memory Bandwidth (GB/s) |
| Parallelism | Full sequence, batch-friendly | Autoregressive, serial per token |
| GPU Utilization Profile | High SM occupancy | Low SM occupancy, high HBM read pressure |
| Ideal Hardware | Dense tensor-core GPUs (H100 SXM) | High-memory-bandwidth GPUs or disaggregated DRAM |
| Batch Sensitivity | Improves with larger batches | Diminishing returns above small batch sizes |
| KV-Cache Role | Written during execution | Read continuously at each step |
When prefill and decode run on the same GPU pool, every long-prompt request that arrives during active decoding interrupts token generation and spikes TTFT for in-flight sequences. Conversely, scheduling decode-heavy traffic on hardware provisioned for prefill wastes tensor core capacity on memory-latency-bound operations. Monolithic colocation forces a false compromise: operators either overprovision to absorb burst prefill load (wasting decode capacity) or accept SLO violations when prefill queues back up.
Technical Warning: The interference effect is nonlinear. At 70% GPU utilization under mixed prefill-decode traffic, P99 TTFT can degrade by 3–5x compared to isolated prefill execution, because a single large prefill job can stall dozens of mid-decode sequences waiting for the next KV-cache read.
Core Principles of Disaggregated Serving
Disaggregated serving physically separates the prefill and decode phases into distinct GPU pools connected by a high-speed fabric. A router receives the incoming request, dispatches it to a prefill worker, which processes the prompt and writes the resulting KV-cache, then transfers that cache to a decode worker that owns the autoregressive generation loop. As one framing puts it: disaggregated serving "splits the bill by using separate GPU clusters—one group focuses on reading, passes the data (the infamous KV cache) to the second cluster that focuses exclusively on writing" (Source: Sriranjani Ramasubramanian, LinkedIn Pulse).
This separation creates independent scaling dimensions. Prefill pool size tracks prompt length distribution and request arrival rate. Decode pool size tracks output token count and active session concurrency. Both can be autoscaled independently against their respective SLOs without impacting the other.
The request lifecycle in a disaggregated system follows a strict sequence:
sequenceDiagram
participant Client
participant Router as llm-d Router / EPP
participant PW as Prefill Worker (GPU Pool A)
participant KV as KV-Cache Sidecar (NIXL/EFA)
participant DW as Decode Worker (GPU Pool B)
Client->>Router: POST /v1/completions {prompt}
Router->>Router: Compute prefix hash, check cache locality
Router->>PW: Dispatch prefill request
PW->>PW: Forward pass over full prompt sequence
PW->>KV: Serialize KV tensors → NIXL buffer
KV->>DW: RDMA transfer via EFA (zero-copy)
DW->>DW: Load KV-cache, begin autoregressive decode
DW-->>Client: Stream tokens (SSE / chunked response)
The critical invariant is that the decode worker never executes a prefill step—it receives a fully materialized KV-cache and begins generating immediately. This eliminates the queuing interaction between the two phases and allows the decode worker's scheduler to maintain tight control over time-per-output-token (TPOT) without prefill interference.
Mechanics of KV-Cache Serialization and Transfer
The architectural elegance of disaggregated serving collapses without a fast KV-cache transfer mechanism. "High-performance disaggregated serving and efficient KV cache offloading critically depend on ultra-fast, low-latency data transfer between GPUs and across diverse memory or storage tiers" (Source: AWS Machine Learning Blog). NIXL (NVIDIA Inference Transfer Library) provides this layer, implementing optimized point-to-point tensor transfer over RDMA with direct GPU memory addressing—bypassing CPU-side staging entirely.
The following code demonstrates how a prefill worker serializes its KV-cache tensor buffer and enqueues it for NIXL transmission to a target decode worker:
import torch
import nixl # NVIDIA Inference Transfer Library runtime bindings
def transfer_kv_cache(
kv_tensors: list[torch.Tensor],
decode_worker_addr: str,
request_id: str,
nixl_agent: nixl.Agent,
) -> str:
"""
Serializes KV-cache tensors from a prefill worker and initiates
an RDMA transfer to the target decode worker via NIXL.
Returns the transfer handle for downstream acknowledgment tracking.
"""
# Flatten layer-wise KV tensors into a contiguous pinned-memory buffer.
# Pinned memory is required for zero-copy DMA from GPU → NIC.
kv_flat = torch.cat(
[t.reshape(-1) for t in kv_tensors], dim=0
).pin_memory()
# Register the buffer with the NIXL agent to obtain a memory handle.
# This step maps GPU virtual addresses into the RDMA-capable region.
mem_handle = nixl_agent.register_buffer(
ptr=kv_flat.data_ptr(),
size=kv_flat.nbytes,
device_id=kv_flat.device.index if kv_flat.is_cuda else -1,
)
# Construct transfer descriptor: target worker endpoint + request metadata.
transfer_desc = nixl.TransferDescriptor(
dst_addr=decode_worker_addr,
mem_handle=mem_handle,
tag=request_id, # Used by decode worker to correlate incoming cache
size_bytes=kv_flat.nbytes,
)
# Initiate non-blocking RDMA write. NIXL manages completion polling internally.
handle = nixl_agent.initiate_transfer(transfer_desc)
return handle # Caller polls handle.wait() before releasing prefill slot
def await_transfer_completion(
handle: nixl.TransferHandle,
timeout_ms: int = 500,
) -> bool:
"""
Blocks until the NIXL transfer completes or timeout is reached.
A False return signals the router to trigger re-planning logic.
"""
return handle.wait(timeout_ms=timeout_ms)
Pro-Tip: Register memory buffers at worker startup, not per-request. NIXL buffer registration carries a one-time overhead of 2–5ms per allocation. Pre-registering a pool of reusable buffers sized to your 99th-percentile KV-cache footprint eliminates this from the critical path.
The metadata tag (request_id) on the transfer descriptor is what the decode worker uses to bind incoming cache data to the correct generation slot. This correlation mechanism must be globally unique and collision-resistant—use ULIDs or UUIDv7 rather than sequential integers.
EFA vs Standard Networking: Performance Trade-offs
KV-cache transfer over standard TCP/IP is architecturally incompatible with tight TTFT targets at scale. A single KV-cache for a 32-layer, 4096-token prompt in BF16 precision occupies roughly 512MB–2GB depending on model size. Transmitting this over a 100Gbps Ethernet fabric with TCP/IP stack overhead—kernel copies, socket buffer traversal, ACK round-trips—introduces 8–40ms of serialization latency per transfer depending on congestion state. Under P99 conditions with concurrent transfers from multiple prefill workers, this becomes the dominant latency source. "NIXL with EFA enables high throughput, low-latency inference serving" (Source: AWS China News) because the critical path from GPU memory on the prefill node to GPU memory on the decode node never touches a kernel buffer.
| Transfer Mechanism | Protocol Stack | Typical Latency (512MB KV) | CPU Overhead | Congestion Sensitivity |
|---|---|---|---|---|
| Standard TCP/IP (100GbE) | Full kernel + socket | 35–80ms P99 | High (memcpy, syscalls) | High |
| RDMA over RoCE | Kernel bypass | 8–15ms P99 | Near-zero | Medium |
| EFA + NIXL (RDMA) | Full kernel bypass + zero-copy | 3–8ms P99 | Minimal | Low (SRD protocol) |
EFA (Elastic Fabric Adapter) on AWS uses the Scalable Reliable Datagram (SRD) transport protocol, which distributes packets across multiple network paths simultaneously and delivers out-of-order completion semantics that NIXL handles natively. The combination eliminates TCP head-of-line blocking—the single largest contributor to P99 latency spikes in multi-tenant inference clusters.
Technical Warning: EFA is a placement-group constraint. Prefill and decode workers must reside in the same EC2 placement group or within the same Availability Zone to operate at rated EFA bandwidth. Cross-AZ EFA traffic falls back to standard networking, silently destroying your latency budget.
Implementing the llm-d Framework in Kubernetes
"llm-d is a high-performance distributed inference serving stack optimized for production deployments on Kubernetes" (Source: GitHub/llm-d), sitting as an orchestration layer between the inference engine (vLLM being the primary backend) and the Kubernetes scheduler. It exposes the Gateway API Inference Extension (GAIE) for intelligent request routing and manages the lifecycle of prefill and decode worker pools as distinct Kubernetes workloads.
The following manifest defines a decode worker Pod with the NIXL KV-cache sidecar injected alongside the primary vLLM container. Note that Kubernetes v1.33+ stable sidecar support is required for correct startup/shutdown ordering:
apiVersion: v1
kind: Pod
metadata:
name: llm-d-decode-worker-0
labels:
llm-d.ai/role: decode
llm-d.ai/model: llama-3-70b
spec:
# Enforce colocation with prefill workers via topology label for EFA adjacency
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
llm-d.ai/role: prefill
llm-d.ai/model: llama-3-70b
topologyKey: topology.kubernetes.io/zone
initContainers:
# Sidecar declared as initContainer with restartPolicy: Always (K8s v1.33+ native sidecar)
- name: nixl-kvcache-agent
image: ghcr.io/llm-d/llm-d-aws:v0.4.1
restartPolicy: Always # Marks this as a native sidecar; lives for pod lifetime
env:
- name: NIXL_ROLE
value: "receiver" # Decode workers receive; prefill workers send
- name: NIXL_LISTEN_PORT
value: "18000"
- name: NIXL_TRANSPORT
value: "efa" # Force EFA transport; fail-fast if unavailable
- name: NIXL_BUFFER_POOL_MB
value: "4096" # Pre-allocate 4GB RDMA buffer pool at startup
- name: NIXL_REQUEST_TIMEOUT_MS
value: "500"
resources:
limits:
hugepages-2Mi: 2Gi # RDMA zero-copy requires hugepage backing
memory: 6Gi
requests:
hugepages-2Mi: 2Gi
memory: 6Gi
volumeMounts:
- name: hugepage-vol
mountPath: /dev/hugepages
containers:
- name: vllm-decode
image: vllm/vllm-openai:v0.6.3
args:
- "--model=/models/llama-3-70b"
- "--tensor-parallel-size=4"
- "--kv-transfer-agent=nixl" # vLLM NIXL integration flag
- "--kv-transfer-port=18000" # Must match sidecar NIXL_LISTEN_PORT
- "--disable-prefill" # This worker ONLY decodes
- "--max-model-len=8192"
resources:
limits:
nvidia.com/gpu: 4
volumeMounts:
- name: model-vol
mountPath: /models
volumes:
- name: hugepage-vol
emptyDir:
medium: HugePages-2Mi
- name: model-vol
persistentVolumeClaim:
claimName: llm-model-pvc
Pro-Tip: Set
NIXL_TRANSPORT=efaexplicitly and configure a liveness probe on the NIXL agent that validates EFA device availability. Without this, pods scheduled on non-EFA-capable nodes will silently fall back to TCP and you will not detect the regression until P99 TTFT breaches SLO in production.
Managing State Persistence and Mid-execution Re-planning
The llm-d router—implemented as an Endpoint Picker (EPP) within the GAIE layer—is responsible for maintaining routing fidelity when a decode worker becomes unavailable mid-generation. "The EPP computes a prefix hash for each request's prompt and routes it to the Pod that already has that prefix cached" (Source: dev.to/x4nent). This prefix-aware routing is the primary mechanism for KV-cache reuse, and it must remain consistent across failover events.
The following illustrates the router's state-check and re-planning logic when a decode worker fails to acknowledge a keepalive during active generation:
import hashlib
import time
from dataclasses import dataclass, field
from enum import Enum
class WorkerState(Enum):
HEALTHY = "healthy"
DEGRADED = "degraded"
FAILED = "failed"
@dataclass
class DecodeSession:
request_id: str
prefix_hash: str
assigned_worker: str
tokens_generated: int
kv_cache_handle: str # NIXL transfer handle reference
last_heartbeat: float = field(default_factory=time.monotonic)
def compute_prefix_hash(prompt: str) -> str:
"""Stable hash for prefix-aware cache locality routing."""
return hashlib.sha256(prompt.encode()).hexdigest()[:16]
def replan_failed_session(
session: DecodeSession,
worker_registry: dict[str, WorkerState],
kv_cache_store: dict[str, bytes], # Shared cache backend (e.g., Redis/NVMe)
) -> DecodeSession | None:
"""
Called when a decode worker fails mid-generation.
Attempts to migrate the session to a healthy worker with cache reinjection.
Returns a new DecodeSession on success, None if unrecoverable.
"""
# Find healthy decode workers, prefer those with matching prefix in local cache
candidates = [
addr for addr, state in worker_registry.items()
if state == WorkerState.HEALTHY and addr != session.assigned_worker
]
if not candidates:
# No healthy workers available; surface 503 to client immediately
return None
# Prioritize workers that already hold the prefix in their KV buffer
preferred = [
addr for addr in candidates
if session.prefix_hash in get_worker_cache_index(addr)
]
target_worker = preferred[0] if preferred else candidates[0]
# Reinject KV-cache from durable store if target lacks local copy
if session.prefix_hash not in get_worker_cache_index(target_worker):
cached_kv = kv_cache_store.get(session.kv_cache_handle)
if cached_kv is None:
# KV-cache was ephemeral and lost; must restart generation from token 0
return None
push_kv_to_worker(target_worker, session.kv_cache_handle, cached_kv)
return DecodeSession(
request_id=session.request_id,
prefix_hash=session.prefix_hash,
assigned_worker=target_worker,
tokens_generated=session.tokens_generated, # Resume from last confirmed token
kv_cache_handle=session.kv_cache_handle,
)
def get_worker_cache_index(worker_addr: str) -> set[str]:
"""Stub: returns the set of prefix hashes cached on a given worker."""
return set() # Implementation-specific cache index lookup
def push_kv_to_worker(worker_addr: str, handle: str, kv_bytes: bytes) -> None:
"""Stub: pushes serialized KV-cache bytes to a worker's NIXL receiver."""
pass # Implementation-specific NIXL transfer call
The key design decision here is whether KV-cache is ephemeral (GPU-local only) or durable (spilled to NVMe or a shared cache store). Ephemeral cache cuts infrastructure complexity but makes failover a full-restart. Durable cache enables seamless session migration but adds write-path latency on every prefill completion.
Optimizing Goodput via Precise Scheduling
Goodput—the rate of useful token generation delivered within SLO bounds—is the primary metric that disaggregation optimizes. The 4.5x goodput improvement cited for disaggregated architectures is not a marketing figure; it is a function of eliminating the multiplicative interference between prefill bursts and decode latency requirements. "Reduce time to first token (TTFT) and get more predictable time per output token (TPOT) by splitting inference into prefill servers handling prompts and decode servers handling responses" (Source: llm-d GitHub). This scheduling implication requires that the prefill-to-decode worker ratio tracks the actual traffic load, not a static estimate.
$$ G_{\text{disagg}} = G_{\text{mono}} \cdot \frac{1}{1 - \left(\alpha \cdot \delta_P + (1-\alpha) \cdot \delta_D\right)} $$
Where: - $G_{\text{mono}}$ = baseline goodput under monolithic serving - $\alpha$ = fraction of GPU time consumed by prefill under current traffic mix - $\delta_P$ = fractional compute waste due to decode interference on prefill hardware - $\delta_D$ = fractional memory bandwidth waste due to prefill interference on decode hardware
In typical production workloads with input-to-output ratios of 4:1 (long-prompt, short-answer patterns), $\alpha$ can reach 0.6–0.7, and interference terms $\delta_P$, $\delta_D$ compound to degrade effective goodput by 60–75% versus the theoretical maximum. Disaggregation drives both $\delta$ terms toward zero by construction. Implementing an autoscaling policy that adjusts this ratio based on rolling 5-minute TTFT and TPOT percentiles gives you dynamic alignment with traffic shape.
Common Pitfalls in Disaggregated Architectures
Disaggregation shifts the bottleneck from GPU utilization to network fabric and orchestration correctness. "Disaggregated serving stacks overwhelmingly rely on network-based KV transfers" (Source: ArXiv 2512.18194)—which means every architectural assumption about network reliability that held in monolithic deployments must be revisited.
Pitfall 1: Static prefill-decode ratios. Deploying a fixed 1:2 prefill-to-decode ratio handles one traffic pattern. When prompt length distribution shifts—a common occurrence with agent-driven workloads generating long context windows—prefill queues back up and TTFT violates SLO despite decode workers sitting idle.
Pitfall 2: Unregistered NIXL buffer pools. Allocating RDMA buffers on the per-request critical path introduces 2–5ms of avoidable latency per transfer. Pre-register pools at pod startup and manage them as a fixed resource with a semaphore-guarded acquisition pattern.
Pitfall 3: Ignoring EFA placement constraints. Scheduling prefill and decode workers across availability zones without placement group enforcement results in cross-AZ network traversal, destroying the bandwidth and latency characteristics that make NIXL viable.
Critical observability metrics for detecting EFA saturation before it becomes a P99 event:
- EFA Queue Depth (
efa_sq_full_events/sec): A non-zero send queue full rate indicates the NIC is saturated. Any sustained value above 0 means KV transfers are queuing behind the fabric, not the GPU. - NIXL Buffer Occupancy (
nixl_buffer_pool_utilization): When pre-registered buffer pool occupancy exceeds 80%, subsequent transfers will either block waiting for free slots or fail with timeout. This metric predicts transfer failures 30–60 seconds before they manifest as TTFT spikes. - Cross-node KV-cache transfer latency P99 (
nixl_transfer_latency_p99_ms): The baseline for EFA-backed NIXL transfers onp5.48xlargeinstances should be under 10ms for 512MB payloads. P99 values exceeding 25ms indicate fabric congestion or topology violations.
Technical Warning: Standard Kubernetes network monitoring (CNI-level metrics) does not expose EFA-specific queue depth. Deploy the AWS EFA monitoring daemonset and configure Prometheus scraping against the EFA kernel module's sysfs interface to get these signals.
Operationalizing the Future of Inference
The total cost of ownership argument for disaggregated serving is structural, not incremental. Monolithic inference forces operators to provision all hardware to the peak-demand profile of the most resource-intensive phase. Disaggregation allows GPU resources to be allocated with surgical precision: smaller, cheaper memory-bandwidth-optimized instances for decode pools; denser compute instances for prefill, scaled transiently based on arrival rate. "This integration enhances disaggregated inference serving through three key improvements: increased KV-cache throughput, reduced inter-token latency, and optimized KV-cache memory utilization" (Source: AWS News).
At production scale—thousands of concurrent sessions, mixed prompt length distributions, SLA-bound enterprise traffic—the compounding effect of eliminating phase interference, enabling independent autoscaling, and routing with cache locality awareness yields both latency improvements (4.5x goodput, measurably lower P99 TTFT) and cost reductions that compound as model sizes and context lengths grow. The engineering investment is real: Kubernetes v1.33+, EFA-capable instance types, NIXL integration, and careful observability instrumentation. But the alternative—continuing to colocate fundamentally incompatible workloads on shared hardware—scales worse with every generation of larger models and longer context windows. Disaggregation is not an optimization; it is the correct architectural primitive for inference at scale.