Bottom Line: KV cache offload to CPU DRAM is a spend-shift, not a free win. When HBM fragmentation or long-context pressure caps concurrency and causes OOMs, offloading relieves that pressure and can deliver meaningful TTFT and throughput improvements — LMCache reports 3–10× delay savings in reuse-heavy workloads like multi-round QA and RAG. The cost side is real: every offloaded cache block crosses a PCIe or NVLink hop, tail latency rises under burst traffic, and operating a tiered memory path adds observability surface, queueing logic, and rollback complexity that nobody budgets for. As NVIDIA frames it plainly: "Holding large KV Caches in GPU memory for long durations is not scalable and forces providers to choose between cost, latency, and capability." The decision belongs to teams whose constraint is HBM capacity, whose workloads have high context-reuse rates, and whose SLOs can absorb variable TTFT. Teams with hard latency SLOs, short-context or low-reuse traffic, or minimal memory pressure should evaluate a larger HBM SKU before adding an offload tier.
What problem KV cache offload actually solves in production
KV cache offload moves the key-value tensors generated during the attention prefill phase from GPU HBM to CPU DRAM or farther-out storage tiers. Its value in production is narrower than marketing often implies: it is a memory-capacity mechanism, not a compute accelerator.
In a standard PagedAttention-backed serving stack, KV blocks occupy HBM for the lifetime of a request. At long context lengths — 32K, 64K, 128K tokens — a single Llama 3.1 70B request can claim tens of gigabytes of HBM before generating a single output token. When concurrent requests compete for those blocks, the scheduler either queues new arrivals or preempts in-flight requests, both of which degrade throughput. Offload breaks that deadlock by evicting cold blocks to a cheaper tier and reloading them when the attention computation needs them again.
Where offload pays is in reuse-heavy workloads: multi-round QA, RAG pipelines with repeated shared prefixes, and agentic loops that re-enter the same document context. LMCache's documentation states that "by offloading KV caches, LMCache can reduce both time-to-first-token (TTFT) and GPU cycles". The gains come not from the offload itself but from the resulting ability to serve more concurrent requests with the available HBM budget — and to reload a previously computed cache block rather than recomputing the prefill from scratch.
Pro Tip: Memory relief and latency improvement are distinct effects. Offload always relieves HBM pressure. It only improves TTFT when a reusable KV block is already resident in the offload tier — if the block must be recomputed anyway, the PCIe transfer overhead is pure cost with no compensating latency benefit.
Why HBM, not compute, is usually the first constraint
On an NVIDIA H100 SXM5, compute throughput scales with batch size, but HBM capacity is fixed at 80 GB (HBM3) or 141 GB (NVL variant). Model weights for a Llama 3.1 70B in BF16 consume roughly 140 GB, requiring at least two H100s; on a two-GPU setup, the remaining HBM after weights — typically 20–40 GB depending on configuration — is the total KV budget across all concurrent requests. At 32K context, a single 70B request can consume 8–12 GB of that budget, capping concurrency at two to four simultaneous long-context requests before the scheduler starts queuing.
That ceiling is a memory problem, not a FLOP problem. The GPU could process larger batches if the KV blocks fit. CPU DRAM on the same host — typically 512 GB to 2 TB in a dense GPU server — dwarfs the HBM budget by one to two orders of magnitude, which is why the offload direction exists.
Watch Out: Do not conflate model weight size with KV growth. Weights are static and load once; KV cache grows linearly with sequence length and batch size simultaneously. A workload that runs fine at 4K context can OOM at 32K even though the model is identical. Profile KV memory under your actual context-length distribution before sizing either your HBM budget or your offload tier.
How long context and multi-turn chat inflate cache residency
Multi-turn chat and agent workloads create a compounding residency problem. Each additional turn appends tokens to the existing sequence, and the KV tensors for all prior turns must remain accessible until the conversation ends. At turn 10 of a 2K-token-per-turn chat, the KV residency is equivalent to a 20K-token single-shot request — and the user has not left the session.
KV cache offload delivers its highest value here precisely because prior-turn KV blocks are candidates for eviction to DRAM between user messages and reloaded when the next prefill begins. LMCache explicitly calls out multi-round QA and RAG as the use cases where this reuse pattern yields 3–10× delay savings and GPU cycle reduction. The benchmarking documentation reinforces the conditional: "without offloading, even with KV Cache reuse, there is no improvement in TTFT nor throughput" on the second query round — meaning that cache reuse requires the offload infrastructure to materialize.
Pro Tip: The 3–10× delay savings LMCache cites is tied to repeated context overlap. Measure your prompt reuse rate — the fraction of incoming requests that share a significant prefix with a cached block — before committing to offload infrastructure. A reuse rate below 20% produces marginal gains that rarely justify the operational overhead.
The production economics: what you save on GPU memory versus what you pay elsewhere
Framing offload as "free memory" is the most common misreading of this feature. NVIDIA's own Dynamo documentation describes a tiered hierarchy spanning HBM, CPU DRAM, SSDs, and networked storage, and positions each tier as a cost-capacity substitution, not a cost-free expansion. The table below maps the trade across production variables; all cost ranges are qualitative estimates because no single official source publishes a unified ROI model for this decision.
| Dimension | GPU-only (HBM) | Host-memory offload (CPU DRAM) |
|---|---|---|
| Memory unit cost | High (HBM embedded in GPU) | Low–moderate (DDR5 DRAM) |
| Capacity per host | 80–141 GB/GPU (H100 variants) | 512 GB–2 TB/host |
| Read latency | ~1–2 µs (HBM bandwidth) | ~10–50 µs + PCIe Gen5 transfer |
| TTFT impact | Baseline | +latency when block not hot in DRAM; −latency on cache hit vs. recompute |
| OOM risk | High at long context | Reduced; spills to DRAM rather than preempting |
| Operational complexity | Low (single memory tier) | Moderate–high (eviction policy, transfer queue, observability) |
| Engineering setup time | Near zero | Estimated 2–6 weeks for production-grade integration |
The GPU-memory spend avoided is real: teams running 128K-context workloads on H100s can serve materially more concurrent sessions without adding GPU nodes. The cost surfaces elsewhere — PCIe bandwidth consumed by evict/reload cycles, DRAM capacity provisioned per host, and engineering time to operate the eviction and reload path correctly.
What changes in capacity planning when KV cache moves to host memory
When KV cache spans two tiers, capacity planning becomes a two-variable sizing problem. The HBM budget must hold enough hot blocks to satisfy current-batch attention without stalling compute; the CPU DRAM budget must hold enough cold blocks to avoid evicting to SSD, which would introduce milliseconds of latency rather than microseconds.
| Tier | Sizing driver | Rule of thumb |
|---|---|---|
| HBM (hot cache) | Active-batch KV at peak concurrency | Target 60–70% HBM utilization after weights; leave headroom for prefill spikes |
| CPU DRAM (warm cache) | Cold-block retention for reuse-heavy sessions | Size to 4–8× the HBM KV budget as a starting point; tune by measuring eviction hit rate |
| SSD / networked (cold cache) | Long-tail session persistence | Only viable if latency SLO allows 1–5 ms reload; most interactive workloads cannot |
NVIDIA's Dynamo KV Cache Manager explicitly describes this hierarchy as a tiered strategy "like CPU host memory, SSDs, or networked storage", reinforcing that the sizing model must account for all occupied tiers simultaneously. Teams upgrading from single-tier HBM serving need to re-baseline their host server specs: CPU DRAM slots, DRAM speed, and NUMA topology all become first-class capacity variables.
How to estimate the latency tax from PCIe or NVLink hops
Every KV block reload from CPU DRAM to HBM crosses an interconnect. On PCIe Gen5 x16, peak bandwidth is approximately 64 GB/s; on NVLink 4.0 (GPU-to-GPU), peak bandwidth is 900 GB/s — but NVLink does not apply to host-memory offload, which by definition traverses the CPU–GPU PCIe bridge.
NVIDIA acknowledges this explicitly: the cost-latency-capability trade-off is inherent to the decision, not a pathology to be engineered away. The latency tax scales with:
- Block size at eviction time (longer context → larger blocks → higher transfer cost)
- Concurrency at reload time (multiple simultaneous reloads saturate PCIe bandwidth)
- NUMA distance between the GPU and the DRAM DIMM hosting the evicted block
No official source publishes a single P99 penalty number because the tax is topology-dependent. Teams must measure it in their actual hardware configuration under representative traffic — not extrapolate from a single-GPU benchmark.
Watch Out: Tail latency under burst traffic is the sharpest edge of this trade-off. When request concurrency spikes, multiple sessions may trigger simultaneous cache reloads, and PCIe bandwidth becomes a serialization bottleneck.
The engineering cost nobody budgets for
Operating a tiered KV cache path in production means operating three additional systems that did not exist before: an eviction policy engine, a transfer queue that must not starve the serving path, and observability instrumentation that makes cache-hit rates and reload latencies visible.
The vLLM Production Stack frames remote KV cache sharing via LMCache as a way to "enable more KV cache hits and potentially mak[e] the deployment more fault tolerant," which is accurate — but it implies additional failure domains. If the LMCache layer crashes or becomes slow, serving degrades in ways that are harder to diagnose than a simple GPU OOM. The vLLM Production Stack also describes LMCache integration as a production path for cache-hit rate improvement and fault tolerance, with remote KV cache sharing moving large KV caches from GPU memory to a shared store.
Pro Tip: Before enabling offload in production, instrument your serving stack to expose KV cache hit rate, reload latency by percentile, and PCIe utilization per GPU. Without these metrics, you cannot distinguish a well-tuned offload path from one that is silently degrading under load. Plan for a platform engineering pilot period before your first production rollout.
Where the ecosystem draws the line: vLLM, LMCache, and host-memory offload paths
The major serving frameworks treat host-memory offload as one option among several memory-scaling strategies, not a default capability. The table below maps conceptual positioning rather than a feature-completeness score, since no official source provides a unified benchmark across all four systems.
| System | Offload approach | Cache sharing | Disaggregated prefill | Maturity signal |
|---|---|---|---|---|
| vLLM 0.7.3 (native) | kv_offloading_size → native CPU backend |
Single-instance only | No | Stable docs, activates on explicit config |
| vLLM Production Stack | LMCache integration for remote KV sharing | Multi-instance, shared store | Supported via LMCache | Production-stack docs published |
| LMCache (standalone) | CPU offload + remote store backends | Cross-instance KV sharing | Full disaggregated prefill | Active development, 3–10× gains cited for reuse workloads |
What vLLM's docs and production stack emphasize
The vLLM Production Stack frames its LMCache integration around two production properties: cache-hit rate improvement and fault tolerance. The remote KV cache sharing docs describe how "remote KV cache sharing moves large KV caches from GPU memory to a remote shared storage, enabling more KV cache hits and potentially making the deployment more fault tolerant." Fault tolerance here means that a replica failure does not invalidate all cached KV state — the remote store persists it.
LMCache as an integration layer extends vLLM's native offload capability with cross-instance sharing, making it relevant for multi-replica deployments where the same document context might be served by any node in the fleet. The vLLM docs describe the LMCache folder as demonstrating "disaggregated prefilling, CPU offloading and KV cache sharing" as distinct use cases — teams should pick the mode that matches their actual bottleneck rather than enabling all three simultaneously.
Pro Tip: The vLLM native
nativeoffload backend is the lowest-complexity entry point. Use it for a single-instance pilot to measure the latency tax in your topology before introducing the LMCache remote-store path, which adds network hops and a separate service to operate.
Why some stacks prefer disaggregated prefill or cache sharing instead
Host-memory offload is not the only way to manage KV pressure, and for some workloads it is the wrong choice. The vLLM Production Stack documents disaggregated prefill as a separate deployment pattern: prefill computation runs on dedicated nodes with large HBM budgets, and only the resulting KV tensors are transferred to decode nodes. This eliminates the prefill-decode interference problem entirely — no eviction, no reload, no PCIe round-trip on the hot path.
| Strategy | Memory pressure relief | Latency on cache miss | Cross-instance reuse | Operational complexity |
|---|---|---|---|---|
| KV cache offload (CPU DRAM) | High | Adds PCIe reload latency | No (native); Yes (LMCache) | Moderate |
| Disaggregated prefill | High | Low (no eviction cycle) | No (KV stays local to decode node) | High (separate prefill cluster) |
| Prefix/radix cache sharing | Moderate | Low (reuse in HBM) | Yes (shared store) | Low–moderate |
KV cache offload is the right choice when memory capacity is the binding constraint and the workload has high reuse rates. Disaggregated prefill is preferable when prefill latency — not memory capacity — is the SLO bottleneck. Prefix cache sharing is the lowest-complexity option when HBM is not exhausted but recomputing shared prefixes wastes GPU cycles.
Workload patterns where KV cache offload makes sense
The answer to "is offloading KV cache to host memory better than increasing GPU memory?" is workload-conditional. LMCache's evidence base for the 3–10× gains covers multi-round QA and RAG — both workloads defined by high prompt-overlap ratios and repeated context reuse. NVIDIA's framing covers the capacity case: when HBM is exhausted and adding GPU nodes is the only alternative, offload to CPU DRAM defers that hardware spend.
| Workload pattern | Latency tolerance | Typical context length | Reuse rate | Offload verdict |
|---|---|---|---|---|
| Multi-turn chat (conversational AI) | Moderate (P95 < 2s TTFT) | 8K–128K | High | Strong candidate |
| RAG with shared document corpus | Moderate–low | 16K–64K | High (corpus prefix) | Strong candidate |
| Agentic loops (tool-call chains) | Low–moderate | 32K–200K | Moderate | Candidate with monitoring |
| Single-shot code completion | Low (P95 < 500ms) | 2K–8K | Low | Unlikely to justify offload |
| Batch summarization / offline jobs | High (minutes) | 8K–128K | Variable | Good candidate if GPU-bound |
| Real-time low-latency API | Very low | 1K–4K | Low | Avoid; buy more HBM |
Long-context chat and agent workloads
Multi-round QA and agentic loops are the workloads where KV cache offload delivers its documented gains. In a multi-turn chat session, prior-turn KV blocks are computed once and can be reused across all subsequent turns in the same session. Without offload, those blocks compete with new-session prefills for HBM, causing preemptions that destroy the reuse value entirely. With offload, cold blocks from idle sessions evict to DRAM, freeing HBM for active sessions, and reload when a returning user triggers the next turn.
LMCache explicitly maps this pattern to its 3–10× delay savings claim for multi-round QA. The gain materializes only when the reuse rate is high — single-shot or highly diverse prompts with little overlap cannot benefit from cache reuse regardless of where the cache resides.
Burst-heavy, memory-fragmented serving environments
When HBM fragmentation — not raw capacity — is the root cause of concurrency limits, KV cache offload can unblock scheduling by giving the eviction policy a destination for stranded blocks. NVIDIA's Dynamo release positions offload as enabling "the instant transfer of KV Cache from limited GPU memory to larger cost-efficient storage," targeting exactly this fragmentation-driven pressure.
| Traffic shape | Offload benefit | Risk |
|---|---|---|
| Steady moderate concurrency | Reliable HBM relief | Low tail-latency risk |
| Burst-heavy with idle troughs | High HBM relief during bursts | PCIe saturation at burst peaks |
| Highly skewed request sizes | Moderate; large requests dominate eviction | Thrashing between tiers |
| Decision | When to choose offload | When to avoid it |
|---|---|---|
| Strong candidate | Fragmentation blocks scheduling and repeated sessions can reuse cached prefixes | Bursts are short, reuse is low, or the SLO cannot absorb reload variance |
| Weak candidate | None | Traffic is mostly single-shot or the host lacks DRAM headroom |
Watch Out: Under load spikes, offload can produce fragile wins. Multiple simultaneous reloads compete for PCIe bandwidth, and if the eviction policy is too aggressive, blocks evicted during a burst may be immediately needed again, creating a thrashing loop that raises both TTFT and GPU utilization without improving throughput.
When a bigger GPU is cheaper than an offload path
For workloads with hard sub-500ms TTFT SLOs, short context lengths, or low reuse rates, the engineering cost of operating an offload path almost certainly exceeds the cost of provisioning more HBM. NVIDIA's own framing acknowledges the trade explicitly: the decision is between "cost, latency, and capability" — and when latency is non-negotiable, adding a slower memory tier is the wrong direction.
| Scenario | Prefer more HBM | Prefer CPU DRAM offload |
|---|---|---|
| Hard P99 TTFT SLO < 500ms | ✓ | ✗ |
| Context length < 8K, low reuse | ✓ | ✗ |
| Team bandwidth limited | ✓ | ✗ |
| 128K+ context, high reuse, cost-sensitive | ✗ | ✓ |
| Multi-instance fleet, shared corpus | ✗ | ✓ (with LMCache remote store) |
The H100 NVL at 141 GB HBM3e roughly doubles the KV budget of an 80 GB H100 SXM5. For teams whose memory pressure stems from a handful of long-context sessions rather than fleet-wide concurrency, upgrading the GPU SKU is often a single-ticket change versus a multi-week offload integration project.
Adoption checklist for a production team
Adopting KV cache offload in production requires platform readiness across four dimensions: observability, queueing policy, rollback safety, and failure-domain isolation. The vLLM Production Stack's framing of LMCache as a mechanism for "potentially making the deployment more fault tolerant" is accurate — but fault tolerance of the cache store does not simplify the serving path's own failure modes.
| Readiness dimension | Minimum before pilot | Target before full rollout |
|---|---|---|
| Observability | KV cache hit rate metric; PCIe utilization per GPU | P50/P95/P99 reload latency; eviction frequency; tier-miss rate |
| Queueing policy | Explicit eviction priority rule | Adaptive policy tuned to context-length distribution |
| Rollback path | kv_offloading_size disabled → native HBM fallback |
Canary deployment with traffic split and automated SLO comparison |
| Failure isolation | LMCache service restart does not block serving | LMCache failure gracefully degrades to no-reuse, not to hard failure |
Production Note: Instrument the offload path before enabling it in production traffic. Without KV cache hit rate and reload latency visibility, you cannot distinguish a correctly functioning offload path from one that is silently adding latency due to NUMA misalignment, PCIe contention, or an eviction policy that thrashes under your traffic shape. Validate the rollback path in staging: disabling
kv_offloading_sizemust restore baseline latency without a serving restart.
Signals that justify a pilot
A pilot is justified when the bottleneck is clearly KV-driven HBM exhaustion rather than compute or weight-loading pressure:
| Signal | Pilot recommendation |
|---|---|
| OOM events on GPU correlate with long-context or multi-turn sessions, not with batch size alone | Start a pilot; the bottleneck is likely KV capacity |
| The KV cache occupancy metric shows greater than 80% utilization during peak traffic windows | Start a pilot; HBM headroom is already thin |
| Request queuing latency increases proportionally with context length, not with output token count | Start a pilot; the scheduler is memory-bound |
| CPU DRAM headroom on the existing host is 4× or more of current peak HBM KV usage | Start a pilot; the host tier can absorb cold blocks |
| Workload prompt-reuse rate exceeds 30% (indicating offloaded blocks will be reloaded, not discarded) | Start a pilot; reuse is high enough to justify the tier |
| Adding a GPU node to the cluster would cost more than 6 months of engineering time at current team capacity | Start a pilot; offload may defer GPU expansion |
Signals that say no
Offload is the wrong choice when the following conditions hold:
| Signal | Why it disqualifies offload |
|---|---|
| TTFT SLO < 300ms P99 | PCIe reload latency alone can exceed this budget under moderate concurrency |
| Context length < 8K, single-shot | KV blocks are small and short-lived; HBM pressure is minimal; offload adds overhead without memory relief |
| Prompt reuse rate < 15% | Blocks are evicted before reuse; gains vanish; only the overhead remains |
| Team already managing > 3 serving-stack integrations | Operational complexity cost exceeds capacity benefit at current team size |
| Memory pressure is from model weights, not KV cache | Offload does not move weights; the constraint is orthogonal |
Watch Out: KV cache offload provides no throughput improvement on workloads without meaningful context reuse. LMCache's own benchmarking documentation confirms: "without offloading, even with KV Cache reuse, there is no improvement in TTFT nor throughput" — and the inverse is equally true: offloading without reuse produces latency cost with no compensating benefit.
FAQ
What is KV cache offloading?
KV cache offloading moves the key-value tensors produced during attention prefill from GPU HBM to a lower-cost memory tier — typically CPU DRAM, and optionally SSD or networked storage. As LMCache defines it: "KV cache offloading allows you to move KV caches from GPU memory to CPU memory or other storage devices." The mechanism enables serving stacks to hold more concurrent session state than HBM alone can fit, at the cost of reload latency when a block is accessed from the offload tier.
Does KV cache offloading improve throughput?
Conditionally yes. Offload improves throughput when HBM capacity — not compute — is the concurrency ceiling and when the workload has sufficient prompt reuse to make reloaded blocks worthwhile. In multi-round QA and RAG, LMCache reports 3–10× delay savings. In single-shot, low-reuse workloads, offload provides no throughput improvement and adds TTFT overhead.
What are the disadvantages of KV cache offloading?
Three categories of cost dominate. First, interconnect latency: every reload crosses a PCIe Gen5 hop, adding tens to hundreds of microseconds per block under load. Second, tail-latency amplification: burst traffic triggers simultaneous reloads that saturate PCIe bandwidth, driving P99 TTFT spikes. Third, operational complexity: an eviction policy, transfer queue, and cache-tier observability stack must be designed, deployed, and maintained — costs that do not appear in "enable offload" configuration guides.
Is offloading KV cache to host memory better than increasing GPU memory?
For workloads with hard latency SLOs or low reuse rates, buying more HBM is almost always the right answer. Offload wins on total cost of ownership when context lengths are long (32K+), reuse rates are high (30%+), and the team has platform engineering capacity to operate the tiered path. The vLLM native backend (kv_offloading_size, native) makes single-instance piloting accessible; the LMCache remote-store path enables cross-instance sharing at the cost of an additional service in the deployment graph.
How do I enable KV cache offload in vLLM?
The vLLM stable config exposes two offload backends: native (built-in CPU offloading) and lmcache (LMCache integration). Offloading activates only when kv_offloading_size is explicitly set — there is no silent default. Start with the native backend at a conservative kv_offloading_size and measure reload latency before switching to the LMCache backend for cross-instance sharing.
Sources & References
- vLLM Documentation v0.7.3 — Authoritative serving framework documentation; canonical reference for PagedAttention, KV cache configuration, and LMCache integration surfaces
- vLLM Cache Config API (stable) — Defines
kv_offloading_size,nativeandlmcachebackends; authoritative for configuration semantics - vLLM Production Stack — KV Cache Sharing — Documents remote KV cache sharing via LMCache as a production deployment pattern
- vLLM Production Stack — Disaggregated Prefill — Frames disaggregated prefill as an alternative memory-scaling strategy distinct from host-memory offload
- vLLM LMCache Examples — Documents disaggregated prefilling, CPU offloading, and KV cache sharing as distinct integration modes
- LMCache Documentation — Primary source for 3–10× delay savings claim; covers multi-round QA and RAG use cases; documents CPU offload quickstart
- LMCache Benchmarking Documentation — Source for the "without offloading, no improvement in TTFT nor throughput" benchmark framing on second-round reuse
- LMCache CPU Offload Quickstart — Documents CPU DRAM as a supported offload destination
- NVIDIA Developer Blog — How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo — NVIDIA's authoritative framing of the cost-latency-capability trade-off; source for "not scalable" quote
- NVIDIA Developer Blog — NVIDIA Dynamo Accelerates LLM-D Community Initiatives — Source for the Dynamo KV Cache Manager tiered-hierarchy description; positions offload as capacity strategy not pure performance play
Keywords: vLLM 0.7.3, vLLM Production Stack, LMCache, PagedAttention, StreamingLLM, NVIDIA H100, HBM, CPU DRAM, PCIe Gen5, NVLink, TTFT, Llama 3.1 70B, SGLang, NVIDIA Dynamo



