Skip to content
AxiomLogicaSearch
AI & ML

Should you offload KV cache to host memory in production inference stacks?

Offloading KV cache to host memory can raise effective concurrency when HBM is the bottleneck, but the article should frame it as a spend-shift decision: lower GPU-memory pressure and fewer OOMs versus higher TTFT and the hidden cost of extra system complexity, PCIe/NVLink traffic, and platform engineering time.

Should you offload KV cache to host memory in production inference stacks?
Should you offload KV cache to host memory in production inference stacks?

Bottom Line: KV cache offload to CPU DRAM is a spend-shift, not a free win. When HBM fragmentation or long-context pressure caps concurrency and causes OOMs, offloading relieves that pressure and can deliver meaningful TTFT and throughput improvements — LMCache reports 3–10× delay savings in reuse-heavy workloads like multi-round QA and RAG. The cost side is real: every offloaded cache block crosses a PCIe or NVLink hop, tail latency rises under burst traffic, and operating a tiered memory path adds observability surface, queueing logic, and rollback complexity that nobody budgets for. As NVIDIA frames it plainly: "Holding large KV Caches in GPU memory for long durations is not scalable and forces providers to choose between cost, latency, and capability." The decision belongs to teams whose constraint is HBM capacity, whose workloads have high context-reuse rates, and whose SLOs can absorb variable TTFT. Teams with hard latency SLOs, short-context or low-reuse traffic, or minimal memory pressure should evaluate a larger HBM SKU before adding an offload tier.


What problem KV cache offload actually solves in production

KV cache offload moves the key-value tensors generated during the attention prefill phase from GPU HBM to CPU DRAM or farther-out storage tiers. Its value in production is narrower than marketing often implies: it is a memory-capacity mechanism, not a compute accelerator.

In a standard PagedAttention-backed serving stack, KV blocks occupy HBM for the lifetime of a request. At long context lengths — 32K, 64K, 128K tokens — a single Llama 3.1 70B request can claim tens of gigabytes of HBM before generating a single output token. When concurrent requests compete for those blocks, the scheduler either queues new arrivals or preempts in-flight requests, both of which degrade throughput. Offload breaks that deadlock by evicting cold blocks to a cheaper tier and reloading them when the attention computation needs them again.

Where offload pays is in reuse-heavy workloads: multi-round QA, RAG pipelines with repeated shared prefixes, and agentic loops that re-enter the same document context. LMCache's documentation states that "by offloading KV caches, LMCache can reduce both time-to-first-token (TTFT) and GPU cycles". The gains come not from the offload itself but from the resulting ability to serve more concurrent requests with the available HBM budget — and to reload a previously computed cache block rather than recomputing the prefill from scratch.

Pro Tip: Memory relief and latency improvement are distinct effects. Offload always relieves HBM pressure. It only improves TTFT when a reusable KV block is already resident in the offload tier — if the block must be recomputed anyway, the PCIe transfer overhead is pure cost with no compensating latency benefit.

Why HBM, not compute, is usually the first constraint

On an NVIDIA H100 SXM5, compute throughput scales with batch size, but HBM capacity is fixed at 80 GB (HBM3) or 141 GB (NVL variant). Model weights for a Llama 3.1 70B in BF16 consume roughly 140 GB, requiring at least two H100s; on a two-GPU setup, the remaining HBM after weights — typically 20–40 GB depending on configuration — is the total KV budget across all concurrent requests. At 32K context, a single 70B request can consume 8–12 GB of that budget, capping concurrency at two to four simultaneous long-context requests before the scheduler starts queuing.

That ceiling is a memory problem, not a FLOP problem. The GPU could process larger batches if the KV blocks fit. CPU DRAM on the same host — typically 512 GB to 2 TB in a dense GPU server — dwarfs the HBM budget by one to two orders of magnitude, which is why the offload direction exists.

Watch Out: Do not conflate model weight size with KV growth. Weights are static and load once; KV cache grows linearly with sequence length and batch size simultaneously. A workload that runs fine at 4K context can OOM at 32K even though the model is identical. Profile KV memory under your actual context-length distribution before sizing either your HBM budget or your offload tier.

How long context and multi-turn chat inflate cache residency

Multi-turn chat and agent workloads create a compounding residency problem. Each additional turn appends tokens to the existing sequence, and the KV tensors for all prior turns must remain accessible until the conversation ends. At turn 10 of a 2K-token-per-turn chat, the KV residency is equivalent to a 20K-token single-shot request — and the user has not left the session.

KV cache offload delivers its highest value here precisely because prior-turn KV blocks are candidates for eviction to DRAM between user messages and reloaded when the next prefill begins. LMCache explicitly calls out multi-round QA and RAG as the use cases where this reuse pattern yields 3–10× delay savings and GPU cycle reduction. The benchmarking documentation reinforces the conditional: "without offloading, even with KV Cache reuse, there is no improvement in TTFT nor throughput" on the second query round — meaning that cache reuse requires the offload infrastructure to materialize.

Pro Tip: The 3–10× delay savings LMCache cites is tied to repeated context overlap. Measure your prompt reuse rate — the fraction of incoming requests that share a significant prefix with a cached block — before committing to offload infrastructure. A reuse rate below 20% produces marginal gains that rarely justify the operational overhead.


The production economics: what you save on GPU memory versus what you pay elsewhere

Framing offload as "free memory" is the most common misreading of this feature. NVIDIA's own Dynamo documentation describes a tiered hierarchy spanning HBM, CPU DRAM, SSDs, and networked storage, and positions each tier as a cost-capacity substitution, not a cost-free expansion. The table below maps the trade across production variables; all cost ranges are qualitative estimates because no single official source publishes a unified ROI model for this decision.

Dimension GPU-only (HBM) Host-memory offload (CPU DRAM)
Memory unit cost High (HBM embedded in GPU) Low–moderate (DDR5 DRAM)
Capacity per host 80–141 GB/GPU (H100 variants) 512 GB–2 TB/host
Read latency ~1–2 µs (HBM bandwidth) ~10–50 µs + PCIe Gen5 transfer
TTFT impact Baseline +latency when block not hot in DRAM; −latency on cache hit vs. recompute
OOM risk High at long context Reduced; spills to DRAM rather than preempting
Operational complexity Low (single memory tier) Moderate–high (eviction policy, transfer queue, observability)
Engineering setup time Near zero Estimated 2–6 weeks for production-grade integration

The GPU-memory spend avoided is real: teams running 128K-context workloads on H100s can serve materially more concurrent sessions without adding GPU nodes. The cost surfaces elsewhere — PCIe bandwidth consumed by evict/reload cycles, DRAM capacity provisioned per host, and engineering time to operate the eviction and reload path correctly.

What changes in capacity planning when KV cache moves to host memory

When KV cache spans two tiers, capacity planning becomes a two-variable sizing problem. The HBM budget must hold enough hot blocks to satisfy current-batch attention without stalling compute; the CPU DRAM budget must hold enough cold blocks to avoid evicting to SSD, which would introduce milliseconds of latency rather than microseconds.

Tier Sizing driver Rule of thumb
HBM (hot cache) Active-batch KV at peak concurrency Target 60–70% HBM utilization after weights; leave headroom for prefill spikes
CPU DRAM (warm cache) Cold-block retention for reuse-heavy sessions Size to 4–8× the HBM KV budget as a starting point; tune by measuring eviction hit rate
SSD / networked (cold cache) Long-tail session persistence Only viable if latency SLO allows 1–5 ms reload; most interactive workloads cannot

NVIDIA's Dynamo KV Cache Manager explicitly describes this hierarchy as a tiered strategy "like CPU host memory, SSDs, or networked storage", reinforcing that the sizing model must account for all occupied tiers simultaneously. Teams upgrading from single-tier HBM serving need to re-baseline their host server specs: CPU DRAM slots, DRAM speed, and NUMA topology all become first-class capacity variables.

Every KV block reload from CPU DRAM to HBM crosses an interconnect. On PCIe Gen5 x16, peak bandwidth is approximately 64 GB/s; on NVLink 4.0 (GPU-to-GPU), peak bandwidth is 900 GB/s — but NVLink does not apply to host-memory offload, which by definition traverses the CPU–GPU PCIe bridge.

NVIDIA acknowledges this explicitly: the cost-latency-capability trade-off is inherent to the decision, not a pathology to be engineered away. The latency tax scales with:

  • Block size at eviction time (longer context → larger blocks → higher transfer cost)
  • Concurrency at reload time (multiple simultaneous reloads saturate PCIe bandwidth)
  • NUMA distance between the GPU and the DRAM DIMM hosting the evicted block

No official source publishes a single P99 penalty number because the tax is topology-dependent. Teams must measure it in their actual hardware configuration under representative traffic — not extrapolate from a single-GPU benchmark.

Watch Out: Tail latency under burst traffic is the sharpest edge of this trade-off. When request concurrency spikes, multiple sessions may trigger simultaneous cache reloads, and PCIe bandwidth becomes a serialization bottleneck.

The engineering cost nobody budgets for

Operating a tiered KV cache path in production means operating three additional systems that did not exist before: an eviction policy engine, a transfer queue that must not starve the serving path, and observability instrumentation that makes cache-hit rates and reload latencies visible.

The vLLM Production Stack frames remote KV cache sharing via LMCache as a way to "enable more KV cache hits and potentially mak[e] the deployment more fault tolerant," which is accurate — but it implies additional failure domains. If the LMCache layer crashes or becomes slow, serving degrades in ways that are harder to diagnose than a simple GPU OOM. The vLLM Production Stack also describes LMCache integration as a production path for cache-hit rate improvement and fault tolerance, with remote KV cache sharing moving large KV caches from GPU memory to a shared store.

Pro Tip: Before enabling offload in production, instrument your serving stack to expose KV cache hit rate, reload latency by percentile, and PCIe utilization per GPU. Without these metrics, you cannot distinguish a well-tuned offload path from one that is silently degrading under load. Plan for a platform engineering pilot period before your first production rollout.


Where the ecosystem draws the line: vLLM, LMCache, and host-memory offload paths

The major serving frameworks treat host-memory offload as one option among several memory-scaling strategies, not a default capability. The table below maps conceptual positioning rather than a feature-completeness score, since no official source provides a unified benchmark across all four systems.

System Offload approach Cache sharing Disaggregated prefill Maturity signal
vLLM 0.7.3 (native) kv_offloading_sizenative CPU backend Single-instance only No Stable docs, activates on explicit config
vLLM Production Stack LMCache integration for remote KV sharing Multi-instance, shared store Supported via LMCache Production-stack docs published
LMCache (standalone) CPU offload + remote store backends Cross-instance KV sharing Full disaggregated prefill Active development, 3–10× gains cited for reuse workloads

What vLLM's docs and production stack emphasize

The vLLM Production Stack frames its LMCache integration around two production properties: cache-hit rate improvement and fault tolerance. The remote KV cache sharing docs describe how "remote KV cache sharing moves large KV caches from GPU memory to a remote shared storage, enabling more KV cache hits and potentially making the deployment more fault tolerant." Fault tolerance here means that a replica failure does not invalidate all cached KV state — the remote store persists it.

LMCache as an integration layer extends vLLM's native offload capability with cross-instance sharing, making it relevant for multi-replica deployments where the same document context might be served by any node in the fleet. The vLLM docs describe the LMCache folder as demonstrating "disaggregated prefilling, CPU offloading and KV cache sharing" as distinct use cases — teams should pick the mode that matches their actual bottleneck rather than enabling all three simultaneously.

Pro Tip: The vLLM native native offload backend is the lowest-complexity entry point. Use it for a single-instance pilot to measure the latency tax in your topology before introducing the LMCache remote-store path, which adds network hops and a separate service to operate.

Why some stacks prefer disaggregated prefill or cache sharing instead

Host-memory offload is not the only way to manage KV pressure, and for some workloads it is the wrong choice. The vLLM Production Stack documents disaggregated prefill as a separate deployment pattern: prefill computation runs on dedicated nodes with large HBM budgets, and only the resulting KV tensors are transferred to decode nodes. This eliminates the prefill-decode interference problem entirely — no eviction, no reload, no PCIe round-trip on the hot path.

Strategy Memory pressure relief Latency on cache miss Cross-instance reuse Operational complexity
KV cache offload (CPU DRAM) High Adds PCIe reload latency No (native); Yes (LMCache) Moderate
Disaggregated prefill High Low (no eviction cycle) No (KV stays local to decode node) High (separate prefill cluster)
Prefix/radix cache sharing Moderate Low (reuse in HBM) Yes (shared store) Low–moderate

KV cache offload is the right choice when memory capacity is the binding constraint and the workload has high reuse rates. Disaggregated prefill is preferable when prefill latency — not memory capacity — is the SLO bottleneck. Prefix cache sharing is the lowest-complexity option when HBM is not exhausted but recomputing shared prefixes wastes GPU cycles.


Workload patterns where KV cache offload makes sense

The answer to "is offloading KV cache to host memory better than increasing GPU memory?" is workload-conditional. LMCache's evidence base for the 3–10× gains covers multi-round QA and RAG — both workloads defined by high prompt-overlap ratios and repeated context reuse. NVIDIA's framing covers the capacity case: when HBM is exhausted and adding GPU nodes is the only alternative, offload to CPU DRAM defers that hardware spend.

Workload pattern Latency tolerance Typical context length Reuse rate Offload verdict
Multi-turn chat (conversational AI) Moderate (P95 < 2s TTFT) 8K–128K High Strong candidate
RAG with shared document corpus Moderate–low 16K–64K High (corpus prefix) Strong candidate
Agentic loops (tool-call chains) Low–moderate 32K–200K Moderate Candidate with monitoring
Single-shot code completion Low (P95 < 500ms) 2K–8K Low Unlikely to justify offload
Batch summarization / offline jobs High (minutes) 8K–128K Variable Good candidate if GPU-bound
Real-time low-latency API Very low 1K–4K Low Avoid; buy more HBM

Long-context chat and agent workloads

Multi-round QA and agentic loops are the workloads where KV cache offload delivers its documented gains. In a multi-turn chat session, prior-turn KV blocks are computed once and can be reused across all subsequent turns in the same session. Without offload, those blocks compete with new-session prefills for HBM, causing preemptions that destroy the reuse value entirely. With offload, cold blocks from idle sessions evict to DRAM, freeing HBM for active sessions, and reload when a returning user triggers the next turn.

LMCache explicitly maps this pattern to its 3–10× delay savings claim for multi-round QA. The gain materializes only when the reuse rate is high — single-shot or highly diverse prompts with little overlap cannot benefit from cache reuse regardless of where the cache resides.

Burst-heavy, memory-fragmented serving environments

When HBM fragmentation — not raw capacity — is the root cause of concurrency limits, KV cache offload can unblock scheduling by giving the eviction policy a destination for stranded blocks. NVIDIA's Dynamo release positions offload as enabling "the instant transfer of KV Cache from limited GPU memory to larger cost-efficient storage," targeting exactly this fragmentation-driven pressure.

Traffic shape Offload benefit Risk
Steady moderate concurrency Reliable HBM relief Low tail-latency risk
Burst-heavy with idle troughs High HBM relief during bursts PCIe saturation at burst peaks
Highly skewed request sizes Moderate; large requests dominate eviction Thrashing between tiers
Decision When to choose offload When to avoid it
Strong candidate Fragmentation blocks scheduling and repeated sessions can reuse cached prefixes Bursts are short, reuse is low, or the SLO cannot absorb reload variance
Weak candidate None Traffic is mostly single-shot or the host lacks DRAM headroom

Watch Out: Under load spikes, offload can produce fragile wins. Multiple simultaneous reloads compete for PCIe bandwidth, and if the eviction policy is too aggressive, blocks evicted during a burst may be immediately needed again, creating a thrashing loop that raises both TTFT and GPU utilization without improving throughput.

When a bigger GPU is cheaper than an offload path

For workloads with hard sub-500ms TTFT SLOs, short context lengths, or low reuse rates, the engineering cost of operating an offload path almost certainly exceeds the cost of provisioning more HBM. NVIDIA's own framing acknowledges the trade explicitly: the decision is between "cost, latency, and capability" — and when latency is non-negotiable, adding a slower memory tier is the wrong direction.

Scenario Prefer more HBM Prefer CPU DRAM offload
Hard P99 TTFT SLO < 500ms
Context length < 8K, low reuse
Team bandwidth limited
128K+ context, high reuse, cost-sensitive
Multi-instance fleet, shared corpus ✓ (with LMCache remote store)

The H100 NVL at 141 GB HBM3e roughly doubles the KV budget of an 80 GB H100 SXM5. For teams whose memory pressure stems from a handful of long-context sessions rather than fleet-wide concurrency, upgrading the GPU SKU is often a single-ticket change versus a multi-week offload integration project.


Adoption checklist for a production team

Adopting KV cache offload in production requires platform readiness across four dimensions: observability, queueing policy, rollback safety, and failure-domain isolation. The vLLM Production Stack's framing of LMCache as a mechanism for "potentially making the deployment more fault tolerant" is accurate — but fault tolerance of the cache store does not simplify the serving path's own failure modes.

Readiness dimension Minimum before pilot Target before full rollout
Observability KV cache hit rate metric; PCIe utilization per GPU P50/P95/P99 reload latency; eviction frequency; tier-miss rate
Queueing policy Explicit eviction priority rule Adaptive policy tuned to context-length distribution
Rollback path kv_offloading_size disabled → native HBM fallback Canary deployment with traffic split and automated SLO comparison
Failure isolation LMCache service restart does not block serving LMCache failure gracefully degrades to no-reuse, not to hard failure

Production Note: Instrument the offload path before enabling it in production traffic. Without KV cache hit rate and reload latency visibility, you cannot distinguish a correctly functioning offload path from one that is silently adding latency due to NUMA misalignment, PCIe contention, or an eviction policy that thrashes under your traffic shape. Validate the rollback path in staging: disabling kv_offloading_size must restore baseline latency without a serving restart.

Signals that justify a pilot

A pilot is justified when the bottleneck is clearly KV-driven HBM exhaustion rather than compute or weight-loading pressure:

Signal Pilot recommendation
OOM events on GPU correlate with long-context or multi-turn sessions, not with batch size alone Start a pilot; the bottleneck is likely KV capacity
The KV cache occupancy metric shows greater than 80% utilization during peak traffic windows Start a pilot; HBM headroom is already thin
Request queuing latency increases proportionally with context length, not with output token count Start a pilot; the scheduler is memory-bound
CPU DRAM headroom on the existing host is 4× or more of current peak HBM KV usage Start a pilot; the host tier can absorb cold blocks
Workload prompt-reuse rate exceeds 30% (indicating offloaded blocks will be reloaded, not discarded) Start a pilot; reuse is high enough to justify the tier
Adding a GPU node to the cluster would cost more than 6 months of engineering time at current team capacity Start a pilot; offload may defer GPU expansion

Signals that say no

Offload is the wrong choice when the following conditions hold:

Signal Why it disqualifies offload
TTFT SLO < 300ms P99 PCIe reload latency alone can exceed this budget under moderate concurrency
Context length < 8K, single-shot KV blocks are small and short-lived; HBM pressure is minimal; offload adds overhead without memory relief
Prompt reuse rate < 15% Blocks are evicted before reuse; gains vanish; only the overhead remains
Team already managing > 3 serving-stack integrations Operational complexity cost exceeds capacity benefit at current team size
Memory pressure is from model weights, not KV cache Offload does not move weights; the constraint is orthogonal

Watch Out: KV cache offload provides no throughput improvement on workloads without meaningful context reuse. LMCache's own benchmarking documentation confirms: "without offloading, even with KV Cache reuse, there is no improvement in TTFT nor throughput" — and the inverse is equally true: offloading without reuse produces latency cost with no compensating benefit.


FAQ

What is KV cache offloading?

KV cache offloading moves the key-value tensors produced during attention prefill from GPU HBM to a lower-cost memory tier — typically CPU DRAM, and optionally SSD or networked storage. As LMCache defines it: "KV cache offloading allows you to move KV caches from GPU memory to CPU memory or other storage devices." The mechanism enables serving stacks to hold more concurrent session state than HBM alone can fit, at the cost of reload latency when a block is accessed from the offload tier.

Does KV cache offloading improve throughput?

Conditionally yes. Offload improves throughput when HBM capacity — not compute — is the concurrency ceiling and when the workload has sufficient prompt reuse to make reloaded blocks worthwhile. In multi-round QA and RAG, LMCache reports 3–10× delay savings. In single-shot, low-reuse workloads, offload provides no throughput improvement and adds TTFT overhead.

What are the disadvantages of KV cache offloading?

Three categories of cost dominate. First, interconnect latency: every reload crosses a PCIe Gen5 hop, adding tens to hundreds of microseconds per block under load. Second, tail-latency amplification: burst traffic triggers simultaneous reloads that saturate PCIe bandwidth, driving P99 TTFT spikes. Third, operational complexity: an eviction policy, transfer queue, and cache-tier observability stack must be designed, deployed, and maintained — costs that do not appear in "enable offload" configuration guides.

Is offloading KV cache to host memory better than increasing GPU memory?

For workloads with hard latency SLOs or low reuse rates, buying more HBM is almost always the right answer. Offload wins on total cost of ownership when context lengths are long (32K+), reuse rates are high (30%+), and the team has platform engineering capacity to operate the tiered path. The vLLM native backend (kv_offloading_size, native) makes single-instance piloting accessible; the LMCache remote-store path enables cross-instance sharing at the cost of an additional service in the deployment graph.

How do I enable KV cache offload in vLLM?

The vLLM stable config exposes two offload backends: native (built-in CPU offloading) and lmcache (LMCache integration). Offloading activates only when kv_offloading_size is explicitly set — there is no silent default. Start with the native backend at a conservative kv_offloading_size and measure reload latency before switching to the LMCache backend for cross-instance sharing.


Sources & References


Keywords: vLLM 0.7.3, vLLM Production Stack, LMCache, PagedAttention, StreamingLLM, NVIDIA H100, HBM, CPU DRAM, PCIe Gen5, NVLink, TTFT, Llama 3.1 70B, SGLang, NVIDIA Dynamo

Was this guide helpful?

The weekly brief.

One email each Sunday with what we tested, what we'd buy, and what to skip. No filler.

Share: X · LinkedIn · Reddit