AI & ML

Architecting for Disaggregated LLM Inference: Prefill-Decode Isolation

By decoupling compute-bound prefill from memory-bound decode using llm-d architectures, engineers can achieve up to 4.5x improvement in goodput and significantly lower P99 TTFT, provided they account for the added network latency of KV-cache serialization over high-speed interconnects like EFA.

By AxiomLogica Editorial

Apr 1, 202615 min read

Reviewed by Editorial

Modern LLM serving under production load exposes a fundamental hardware mismatch: the two phases of inference—prefill and decode—have diametrically opposed resource profiles, yet monolithic deployments force them to compete for the same silicon. The consequence is predictable: neither phase runs efficiently, P99 TTFT bloats under concurrency, and GPU utilization figures that look reasonable in dashboards mask deep structural waste. Disaggregated inference solves this by partitioning the compute graph into dedicated pools, each sized and scheduled for its actual workload. The llm-d framework on AWS operationalizes this architecture in Kubernetes-native environments, pairing it with NIXL and EFA to make KV-cache transfer fast enough to justify the coordination overhead.

The Mechanical Limit: Why Monolithic Inference Fails at Scale

LLM inference is not a single computational workload. Prefill processes the entire input prompt in a single parallel pass—it is compute-bound, saturating tensor core utilization and demanding peak FLOP/s. Decode, by contrast, generates one token per forward pass, gated entirely by how fast the GPU can read KV-cache tensors from HBM—it is memory-bandwidth-bound. These two profiles are irreconcilable on shared hardware, leading to the widely recognized performance degradation cited in recent industry research (Source: Awesome Agents, 2026).

Characteristic	Prefill Phase	Decode Phase
Bottleneck	Compute (FLOP/s)	Memory Bandwidth (GB/s)
Parallelism	Full sequence, batch-friendly	Autoregressive, serial per token
GPU Utilization Profile	High SM occupancy	Low SM occupancy, high HBM read pressure
Ideal Hardware	Dense tensor-core GPUs (H100 SXM)	High-memory-bandwidth GPUs or disaggregated DRAM
Batch Sensitivity	Improves with larger batches	Diminishing returns above small batch sizes
KV-Cache Role	Written during execution	Read continuously at each step

When prefill and decode run on the same GPU pool, every long-prompt request that arrives during active decoding interrupts token generation and spikes TTFT for in-flight sequences. Conversely, scheduling decode-heavy traffic on hardware provisioned for prefill wastes tensor core capacity on memory-latency-bound operations. Monolithic colocation forces a false compromise: operators either overprovision to absorb burst prefill load (wasting decode capacity) or accept SLO violations when prefill queues back up.

Technical Warning: The interference effect is nonlinear. At 70% GPU utilization under mixed prefill-decode traffic, P99 TTFT can degrade by 3–5x compared to isolated prefill execution, because a single large prefill job can stall dozens of mid-decode sequences waiting for the next KV-cache read.

Core Principles of Disaggregated Serving

Disaggregated serving physically separates the prefill and decode phases into distinct GPU pools connected by a high-speed fabric. A router receives the incoming request, dispatches it to a prefill worker, which processes the prompt and writes the resulting KV-cache, then transfers that cache to a decode worker that owns the autoregressive generation loop. As one framing puts it: disaggregated serving "splits the bill by using separate GPU clusters—one group focuses on reading, passes the data (the infamous KV cache) to the second cluster that focuses exclusively on writing" (Source: Sriranjani Ramasubramanian, LinkedIn Pulse).

This separation creates independent scaling dimensions. Prefill pool size tracks prompt length distribution and request arrival rate. Decode pool size tracks output token count and active session concurrency. Both can be autoscaled independently against their respective SLOs without impacting the other.

The request lifecycle in a disaggregated system follows a strict sequence:

sequenceDiagram
    participant Client
    participant Router as llm-d Router / EPP
    participant PW as Prefill Worker (GPU Pool A)
    participant KV as KV-Cache Sidecar (NIXL/EFA)
    participant DW as Decode Worker (GPU Pool B)

    Client->>Router: POST /v1/completions {prompt}
    Router->>Router: Compute prefix hash, check cache locality
    Router->>PW: Dispatch prefill request
    PW->>PW: Forward pass over full prompt sequence
    PW->>KV: Serialize KV tensors → NIXL buffer
    KV->>DW: RDMA transfer via EFA (zero-copy)
    DW->>DW: Load KV-cache, begin autoregressive decode
    DW-->>Client: Stream tokens (SSE / chunked response)

The critical invariant is that the decode worker never executes a prefill step—it receives a fully materialized KV-cache and begins generating immediately. This eliminates the queuing interaction between the two phases and allows the decode worker's scheduler to maintain tight control over time-per-output-token (TPOT) without prefill interference.

Mechanics of KV-Cache Serialization and Transfer

The architectural elegance of disaggregated serving collapses without a fast KV-cache transfer mechanism. "High-performance disaggregated serving and efficient KV cache offloading critically depend on ultra-fast, low-latency data transfer between GPUs and across diverse memory or storage tiers" (Source: AWS Machine Learning Blog). NIXL (NVIDIA Inference Transfer Library) provides this layer, implementing optimized point-to-point tensor transfer over RDMA with direct GPU memory addressing—bypassing CPU-side staging entirely.

The following code demonstrates how a prefill worker serializes its KV-cache tensor buffer and enqueues it for NIXL transmission to a target decode worker:

import torch
import nixl  # NVIDIA Inference Transfer Library runtime bindings

def transfer_kv_cache(
    kv_tensors: list[torch.Tensor],
    decode_worker_addr: str,
    request_id: str,
    nixl_agent: nixl.Agent,
) -> str:
    """
    Serializes KV-cache tensors from a prefill worker and initiates
    an RDMA transfer to the target decode worker via NIXL.

    Returns the transfer handle for downstream acknowledgment tracking.
    """
    # Flatten layer-wise KV tensors into a contiguous pinned-memory buffer.
    # Pinned memory is required for zero-copy DMA from GPU → NIC.
    kv_flat = torch.cat(
        [t.reshape(-1) for t in kv_tensors], dim=0
    ).pin_memory()

    # Register the buffer with the NIXL agent to obtain a memory handle.
    # This step maps GPU virtual addresses into the RDMA-capable region.
    mem_handle = nixl_agent.register_buffer(
        ptr=kv_flat.data_ptr(),
        size=kv_flat.nbytes,
        device_id=kv_flat.device.index if kv_flat.is_cuda else -1,
    )

    # Construct transfer descriptor: target worker endpoint + request metadata.
    transfer_desc = nixl.TransferDescriptor(
        dst_addr=decode_worker_addr,
        mem_handle=mem_handle,
        tag=request_id,          # Used by decode worker to correlate incoming cache
        size_bytes=kv_flat.nbytes,
    )

    # Initiate non-blocking RDMA write. NIXL manages completion polling internally.
    handle = nixl_agent.initiate_transfer(transfer_desc)

    return handle  # Caller polls handle.wait() before releasing prefill slot


def await_transfer_completion(
    handle: nixl.TransferHandle,
    timeout_ms: int = 500,
) -> bool:
    """
    Blocks until the NIXL transfer completes or timeout is reached.
    A False return signals the router to trigger re-planning logic.
    """
    return handle.wait(timeout_ms=timeout_ms)

Pro-Tip: Register memory buffers at worker startup, not per-request. NIXL buffer registration carries a one-time overhead of 2–5ms per allocation. Pre-registering a pool of reusable buffers sized to your 99th-percentile KV-cache footprint eliminates this from the critical path.

The metadata tag (request_id) on the transfer descriptor is what the decode worker uses to bind incoming cache data to the correct generation slot. This correlation mechanism must be globally unique and collision-resistant—use ULIDs or UUIDv7 rather than sequential integers.

EFA vs Standard Networking: Performance Trade-offs

KV-cache transfer over standard TCP/IP is architecturally incompatible with tight TTFT targets at scale. A single KV-cache for a 32-layer, 4096-token prompt in BF16 precision occupies roughly 512MB–2GB depending on model size. Transmitting this over a 100Gbps Ethernet fabric with TCP/IP stack overhead—kernel copies, socket buffer traversal, ACK round-trips—introduces 8–40ms of serialization latency per transfer depending on congestion state. Under P99 conditions with concurrent transfers from multiple prefill workers, this becomes the dominant latency source. "NIXL with EFA enables high throughput, low-latency inference serving" (Source: AWS China News) because the critical path from GPU memory on the prefill node to GPU memory on the decode node never touches a kernel buffer.

Transfer Mechanism	Protocol Stack	Typical Latency (512MB KV)	CPU Overhead	Congestion Sensitivity
Standard TCP/IP (100GbE)	Full kernel + socket	35–80ms P99	High (memcpy, syscalls)	High
RDMA over RoCE	Kernel bypass	8–15ms P99	Near-zero	Medium
EFA + NIXL (RDMA)	Full kernel bypass + zero-copy	3–8ms P99	Minimal	Low (SRD protocol)

EFA (Elastic Fabric Adapter) on AWS uses the Scalable Reliable Datagram (SRD) transport protocol, which distributes packets across multiple network paths simultaneously and delivers out-of-order completion semantics that NIXL handles natively. The combination eliminates TCP head-of-line blocking—the single largest contributor to P99 latency spikes in multi-tenant inference clusters.

Technical Warning: EFA is a placement-group constraint. Prefill and decode workers must reside in the same EC2 placement group or within the same Availability Zone to operate at rated EFA bandwidth. Cross-AZ EFA traffic falls back to standard networking, silently destroying your latency budget.

Implementing the llm-d Framework in Kubernetes

"llm-d is a high-performance distributed inference serving stack optimized for production deployments on Kubernetes" (Source: GitHub/llm-d), sitting as an orchestration layer between the inference engine (vLLM being the primary backend) and the Kubernetes scheduler. It exposes the Gateway API Inference Extension (GAIE) for intelligent request routing and manages the lifecycle of prefill and decode worker pools as distinct Kubernetes workloads.

The following manifest defines a decode worker Pod with the NIXL KV-cache sidecar injected alongside the primary vLLM container. Note that Kubernetes v1.33+ stable sidecar support is required for correct startup/shutdown ordering:

apiVersion: v1
kind: Pod
metadata:
  name: llm-d-decode-worker-0
  labels:
    llm-d.ai/role: decode
    llm-d.ai/model: llama-3-70b
spec:
  # Enforce colocation with prefill workers via topology label for EFA adjacency
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              llm-d.ai/role: prefill
              llm-d.ai/model: llama-3-70b
          topologyKey: topology.kubernetes.io/zone

  initContainers:
    # Sidecar declared as initContainer with restartPolicy: Always (K8s v1.33+ native sidecar)
    - name: nixl-kvcache-agent
      image: ghcr.io/llm-d/llm-d-aws:v0.4.1
      restartPolicy: Always  # Marks this as a native sidecar; lives for pod lifetime
      env:
        - name: NIXL_ROLE
          value: "receiver"           # Decode workers receive; prefill workers send
        - name: NIXL_LISTEN_PORT
          value: "18000"
        - name: NIXL_TRANSPORT
          value: "efa"               # Force EFA transport; fail-fast if unavailable
        - name: NIXL_BUFFER_POOL_MB
          value: "4096"              # Pre-allocate 4GB RDMA buffer pool at startup
        - name: NIXL_REQUEST_TIMEOUT_MS
          value: "500"
      resources:
        limits:
          hugepages-2Mi: 2Gi         # RDMA zero-copy requires hugepage backing
          memory: 6Gi
        requests:
          hugepages-2Mi: 2Gi
          memory: 6Gi
      volumeMounts:
        - name: hugepage-vol
          mountPath: /dev/hugepages

  containers:
    - name: vllm-decode
      image: vllm/vllm-openai:v0.6.3
      args:
        - "--model=/models/llama-3-70b"
        - "--tensor-parallel-size=4"
        - "--kv-transfer-agent=nixl"      # vLLM NIXL integration flag
        - "--kv-transfer-port=18000"      # Must match sidecar NIXL_LISTEN_PORT
        - "--disable-prefill"             # This worker ONLY decodes
        - "--max-model-len=8192"
      resources:
        limits:
          nvidia.com/gpu: 4
      volumeMounts:
        - name: model-vol
          mountPath: /models

  volumes:
    - name: hugepage-vol
      emptyDir:
        medium: HugePages-2Mi
    - name: model-vol
      persistentVolumeClaim:
        claimName: llm-model-pvc

Pro-Tip: Set NIXL_TRANSPORT=efa explicitly and configure a liveness probe on the NIXL agent that validates EFA device availability. Without this, pods scheduled on non-EFA-capable nodes will silently fall back to TCP and you will not detect the regression until P99 TTFT breaches SLO in production.

Managing State Persistence and Mid-execution Re-planning

The llm-d router—implemented as an Endpoint Picker (EPP) within the GAIE layer—is responsible for maintaining routing fidelity when a decode worker becomes unavailable mid-generation. "The EPP computes a prefix hash for each request's prompt and routes it to the Pod that already has that prefix cached" (Source: dev.to/x4nent). This prefix-aware routing is the primary mechanism for KV-cache reuse, and it must remain consistent across failover events.

The following illustrates the router's state-check and re-planning logic when a decode worker fails to acknowledge a keepalive during active generation:

import hashlib
import time
from dataclasses import dataclass, field
from enum import Enum

class WorkerState(Enum):
    HEALTHY = "healthy"
    DEGRADED = "degraded"
    FAILED = "failed"

@dataclass
class DecodeSession:
    request_id: str
    prefix_hash: str
    assigned_worker: str
    tokens_generated: int
    kv_cache_handle: str              # NIXL transfer handle reference
    last_heartbeat: float = field(default_factory=time.monotonic)

def compute_prefix_hash(prompt: str) -> str:
    """Stable hash for prefix-aware cache locality routing."""
    return hashlib.sha256(prompt.encode()).hexdigest()[:16]

def replan_failed_session(
    session: DecodeSession,
    worker_registry: dict[str, WorkerState],
    kv_cache_store: dict[str, bytes],  # Shared cache backend (e.g., Redis/NVMe)
) -> DecodeSession | None:
    """
    Called when a decode worker fails mid-generation.
    Attempts to migrate the session to a healthy worker with cache reinjection.
    Returns a new DecodeSession on success, None if unrecoverable.
    """
    # Find healthy decode workers, prefer those with matching prefix in local cache
    candidates = [
        addr for addr, state in worker_registry.items()
        if state == WorkerState.HEALTHY and addr != session.assigned_worker
    ]

    if not candidates:
        # No healthy workers available; surface 503 to client immediately
        return None

    # Prioritize workers that already hold the prefix in their KV buffer
    preferred = [
        addr for addr in candidates
        if session.prefix_hash in get_worker_cache_index(addr)
    ]

    target_worker = preferred[0] if preferred else candidates[0]

    # Reinject KV-cache from durable store if target lacks local copy
    if session.prefix_hash not in get_worker_cache_index(target_worker):
        cached_kv = kv_cache_store.get(session.kv_cache_handle)
        if cached_kv is None:
            # KV-cache was ephemeral and lost; must restart generation from token 0
            return None
        push_kv_to_worker(target_worker, session.kv_cache_handle, cached_kv)

    return DecodeSession(
        request_id=session.request_id,
        prefix_hash=session.prefix_hash,
        assigned_worker=target_worker,
        tokens_generated=session.tokens_generated,  # Resume from last confirmed token
        kv_cache_handle=session.kv_cache_handle,
    )

def get_worker_cache_index(worker_addr: str) -> set[str]:
    """Stub: returns the set of prefix hashes cached on a given worker."""
    return set() # Implementation-specific cache index lookup

def push_kv_to_worker(worker_addr: str, handle: str, kv_bytes: bytes) -> None:
    """Stub: pushes serialized KV-cache bytes to a worker's NIXL receiver."""
    pass # Implementation-specific NIXL transfer call

The key design decision here is whether KV-cache is ephemeral (GPU-local only) or durable (spilled to NVMe or a shared cache store). Ephemeral cache cuts infrastructure complexity but makes failover a full-restart. Durable cache enables seamless session migration but adds write-path latency on every prefill completion.

Optimizing Goodput via Precise Scheduling

Goodput—the rate of useful token generation delivered within SLO bounds—is the primary metric that disaggregation optimizes. The 4.5x goodput improvement cited for disaggregated architectures is not a marketing figure; it is a function of eliminating the multiplicative interference between prefill bursts and decode latency requirements. "Reduce time to first token (TTFT) and get more predictable time per output token (TPOT) by splitting inference into prefill servers handling prompts and decode servers handling responses" (Source: llm-d GitHub). This scheduling implication requires that the prefill-to-decode worker ratio tracks the actual traffic load, not a static estimate.

$$ G_{\text{disagg}} = G_{\text{mono}} \cdot \frac{1}{1 - \left(\alpha \cdot \delta_P + (1-\alpha) \cdot \delta_D\right)} $$

Where: - $G_{\text{mono}}$ = baseline goodput under monolithic serving - $\alpha$ = fraction of GPU time consumed by prefill under current traffic mix - $\delta_P$ = fractional compute waste due to decode interference on prefill hardware - $\delta_D$ = fractional memory bandwidth waste due to prefill interference on decode hardware

In typical production workloads with input-to-output ratios of 4:1 (long-prompt, short-answer patterns), $\alpha$ can reach 0.6–0.7, and interference terms $\delta_P$, $\delta_D$ compound to degrade effective goodput by 60–75% versus the theoretical maximum. Disaggregation drives both $\delta$ terms toward zero by construction. Implementing an autoscaling policy that adjusts this ratio based on rolling 5-minute TTFT and TPOT percentiles gives you dynamic alignment with traffic shape.

Common Pitfalls in Disaggregated Architectures

Disaggregation shifts the bottleneck from GPU utilization to network fabric and orchestration correctness. "Disaggregated serving stacks overwhelmingly rely on network-based KV transfers" (Source: ArXiv 2512.18194)—which means every architectural assumption about network reliability that held in monolithic deployments must be revisited.

Pitfall 1: Static prefill-decode ratios. Deploying a fixed 1:2 prefill-to-decode ratio handles one traffic pattern. When prompt length distribution shifts—a common occurrence with agent-driven workloads generating long context windows—prefill queues back up and TTFT violates SLO despite decode workers sitting idle.

Pitfall 2: Unregistered NIXL buffer pools. Allocating RDMA buffers on the per-request critical path introduces 2–5ms of avoidable latency per transfer. Pre-register pools at pod startup and manage them as a fixed resource with a semaphore-guarded acquisition pattern.

Pitfall 3: Ignoring EFA placement constraints. Scheduling prefill and decode workers across availability zones without placement group enforcement results in cross-AZ network traversal, destroying the bandwidth and latency characteristics that make NIXL viable.

Critical observability metrics for detecting EFA saturation before it becomes a P99 event:

EFA Queue Depth (efa_sq_full_events/sec): A non-zero send queue full rate indicates the NIC is saturated. Any sustained value above 0 means KV transfers are queuing behind the fabric, not the GPU.
NIXL Buffer Occupancy (nixl_buffer_pool_utilization): When pre-registered buffer pool occupancy exceeds 80%, subsequent transfers will either block waiting for free slots or fail with timeout. This metric predicts transfer failures 30–60 seconds before they manifest as TTFT spikes.
Cross-node KV-cache transfer latency P99 (nixl_transfer_latency_p99_ms): The baseline for EFA-backed NIXL transfers on p5.48xlarge instances should be under 10ms for 512MB payloads. P99 values exceeding 25ms indicate fabric congestion or topology violations.

Technical Warning: Standard Kubernetes network monitoring (CNI-level metrics) does not expose EFA-specific queue depth. Deploy the AWS EFA monitoring daemonset and configure Prometheus scraping against the EFA kernel module's sysfs interface to get these signals.

Operationalizing the Future of Inference

The total cost of ownership argument for disaggregated serving is structural, not incremental. Monolithic inference forces operators to provision all hardware to the peak-demand profile of the most resource-intensive phase. Disaggregation allows GPU resources to be allocated with surgical precision: smaller, cheaper memory-bandwidth-optimized instances for decode pools; denser compute instances for prefill, scaled transiently based on arrival rate. "This integration enhances disaggregated inference serving through three key improvements: increased KV-cache throughput, reduced inter-token latency, and optimized KV-cache memory utilization" (Source: AWS News).

At production scale—thousands of concurrent sessions, mixed prompt length distributions, SLA-bound enterprise traffic—the compounding effect of eliminating phase interference, enabling independent autoscaling, and routing with cache locality awareness yields both latency improvements (4.5x goodput, measurably lower P99 TTFT) and cost reductions that compound as model sizes and context lengths grow. The engineering investment is real: Kubernetes v1.33+, EFA-capable instance types, NIXL integration, and careful observability instrumentation. But the alternative—continuing to colocate fundamentally incompatible workloads on shared hardware—scales worse with every generation of larger models and longer context windows. Disaggregation is not an optimization; it is the correct architectural primitive for inference at scale.

Was this guide helpful?

Share: X · LinkedIn · Reddit

The Mechanical Limit: Why Monolithic Inference Fails at Scale

Core Principles of Disaggregated Serving

Mechanics of KV-Cache Serialization and Transfer

EFA vs Standard Networking: Performance Trade-offs

Implementing the llm-d Framework in Kubernetes

Managing State Persistence and Mid-execution Re-planning

Optimizing Goodput via Precise Scheduling

Common Pitfalls in Disaggregated Architectures

Operationalizing the Future of Inference

The weekly brief.

Related reading

Structured Pruning vs. 4-Bit Quantization for Edge LLMs: A Technical Trade-off Analysis

Integrating HiPPO-Initialized SSM Subsystems into LLM Architectures

Engineering the Quantized Johnson-Lindenstrauss (QJL) Transform for Distributed Inference