What problem automatic prefix caching solves in vLLM
Automatic prefix caching is vLLM's built-in mechanism for reusing complete KV blocks when multiple requests share the same prompt prefix.
Redundant prompt computation is one of the most expensive hidden costs in production LLM serving. When hundreds of concurrent requests share a common system prompt — a 2,000-token instruction block for a coding assistant, a 4,000-token RAG context prepended to every query — the GPU recomputes identical key-value tensors for every single request. That wasted work inflates time-to-first-token (TTFT) linearly with prefix length and consumes physical memory for KV state that is mathematically identical across sessions.
Automatic prefix caching in vLLM eliminates that redundancy. The vLLM prefix caching documentation describes prefix caching as "a popular optimization in LLM inference to avoid redundant prompt computations," and positions it explicitly as an extension of PagedAttention's block-based KV management rather than a separate caching layer bolted on top. The mechanism reuses KV-cache blocks from previously processed requests whenever a new request shares the same prefix — no re-execution, no extra allocation.
Bottom Line: Automatic prefix caching in vLLM assigns each complete KV block a content-addressed hash derived from its token content and its position in the sequence, stores those blocks in a global hash table, and reuses them across any request that produces the same hash. Eviction follows a strict ordering — refcount zero first, then LRU, then longer-prefix tie-breaking — ensuring live blocks are never reclaimed. The mechanism does not touch sampling or decoding, so outputs remain deterministic for identical prompts. Its hard constraints are partial blocks (which cannot be hashed and cached) and the need to extend the hash key for multi-LoRA and multimodal workloads.
How PagedAttention sets up block-level KV reuse
Automatic prefix caching is not intelligible without PagedAttention as its foundation. PagedAttention's central contribution is decomposing a request's KV cache from a single contiguous allocation into a sequence of fixed-size KV blocks that can occupy non-contiguous physical memory. As the vLLM design doc states: "The core idea of PagedAttention is to partition the KV cache of each request into KV Blocks." Each block holds the attention keys and values for a fixed number of tokens — the block size is a deployment-time constant. "In particular, vLLM introduces Paged-Attention (PA) … which partitions the KV-cache of each request into fixed-size blocks, each holding a predefined number of key-value tokens."
The non-contiguous layout matters because it lets the allocator hand out physical pages on demand rather than reserving a worst-case contiguous region per request. As the same doc notes, "The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand." An arXiv analysis of scalable LLM serving confirms that "vLLM introduces PagedAttention (PA) … which partitions the KV-cache of each request into fixed-size blocks, each holding a predefined number of key-value tokens."
flowchart LR
subgraph Request["Single Request KV Cache"]
direction LR
B0["Block 0\ntokens 0–15"]
B1["Block 1\ntokens 16–31"]
B2["Block 2\ntokens 32–47"]
end
subgraph Physical["Physical GPU Memory (non-contiguous)"]
P3["Physical page 3"]
P7["Physical page 7"]
P1["Physical page 1"]
end
B0 --> P3
B1 --> P7
B2 --> P1
This paging structure is what makes prefix reuse possible at block granularity: if two requests share the first 32 tokens and the block size is 16, both requests' logical blocks 0 and 1 are candidates for mapping to the same physical pages — provided the engine can recognize that the contents are identical.
Why fixed-size KV blocks matter for cache reuse
Fixed block size is the prerequisite for content-addressed identification. Because each block covers exactly the same number of tokens, a block's position in the sequence and its token content together form a complete, stable identity. "Each block contains the attention keys and values for a fixed number of tokens," which means the block boundary is deterministic: block $i$ always spans token positions ([i \cdot B,\, (i+1) \cdot B - 1]) for block size $B$.
flowchart LR
Seq["Token sequence\n[t0, t1, ... t47]"]
Seq --> |"tokens 0-15"| Blk0["Block 0"]
Seq --> |"tokens 16-31"| Blk1["Block 1"]
Seq --> |"tokens 32-47"| Blk2["Block 2"]
Blk0 --> Hash0["hash([], [t0..t15])"]
Blk1 --> Hash1["hash([t0..t15], [t16..t31])"]
Blk2 --> Hash2["hash([t0..t31], [t32..t47])"]
Variable-length blocks would destroy this property: a block boundary shift from one request to the next would produce a different hash even for identical token content, defeating the whole lookup. Fixed size also means the allocator can treat all physical blocks as interchangeable memory units, further reducing fragmentation, as confirmed by the design doc's on-demand allocation model.
Logical blocks, physical blocks, and the new indirection layer
Without prefix caching, a logical block maps directly to one physical block owned exclusively by one request. Automatic prefix caching inserts a hash as an intermediate indirection layer, creating a three-level mapping: logical block → hash → physical block.
flowchart LR
subgraph ReqA["Request A"]
LA0["Logical block 0"]
LA1["Logical block 1"]
end
subgraph ReqB["Request B (same prefix)"]
LB0["Logical block 0"]
LB1["Logical block 1"]
end
GHT["Global Hash Table\nhash → physical block ptr"]
PA0["Physical block P3\n(shared)"]
PA1["Physical block P7\n(shared)"]
PA2["Physical block P2\n(Request A only)"]
PB2["Physical block P9\n(Request B only)"]
LA0 -->|"hash_0"| GHT
LA1 -->|"hash_1"| GHT
LB0 -->|"hash_0"| GHT
LB1 -->|"hash_1"| GHT
GHT -->|"hash_0 hit"| PA0
GHT -->|"hash_1 hit"| PA1
LA0 -.->|"miss → allocate"| PA2
LB0 -.->|"miss → allocate"| PB2
The vLLM design doc states that "vLLM maintains a global hash table of physical blocks" and that "this removes the need for a tree of shared prefixes." The architectural consequence of skipping the tree is significant: lookup is $O(1)$ per block via hash table rather than $O(L)$ per prefix traversal, and the data structure carries no ordering constraint that would complicate concurrent inserts.
How vLLM hashes complete prefix blocks
The hash function is the mechanism that converts a block's identity into an address in the global table. The vLLM design doc states that each block's hash encodes two things: which tokens appear inside the block, and what token sequence preceded it in the request. As the design doc states: "Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block."
Formally, for block $i$ in a sequence with block size $B$:
$(h_i = \text{hash}!\left(\text{prefix_tokens}{[0,\, iB-1]},\; \text{block_tokens}\right))$
where (\text{prefix_tokens}) is the ordered tuple of all token IDs preceding this block, and (\text{block_tokens}) is the ordered tuple of token IDs within the block.
The engine stores (h_i) as the key in the global hash table, and the value is a pointer to the physical block that holds the materialized KV tensors for those tokens.
Why the hash uses prefix tokens plus block tokens
Including the prefix in the hash prevents collisions between blocks that contain the same tokens but appear at different depths in the sequence. Consider two separate conversations where the word "Paris" appears as the first token of block 3 — the surrounding context is different, so the KV activations are different, and the hashes must differ. Without the prefix, the hash function would map both to the same physical block and corrupt the KV state.
$(h_i^{(A)} = \text{hash}!\left(T^{(A)}{[0,iB-1]},\; T^{(A)}\right))$ $(h_i^{(B)} = \text{hash}!\left(T^{(B)}{[0,iB-1]},\; T^{(B)}\right))$
If (T^{(A)}{[0,iB-1]} \ne T^{(B)}}), then (h_i^{(A)} \ne h_i^{(B)}) with overwhelming probability, even when (T^{(A){[iB,(i+1)B-1]} = T^{(B)}).
This design achieves content-addressed lookup without a radix tree or prefix trie. A radix tree would require the engine to maintain a shared prefix-tree data structure across all active requests, update internal tree nodes on every new request, and traverse from the root on every lookup. The hash table approach collapses that to a single table insertion and a single key lookup per block — each block is independently addressable.
What a global hash table changes in the request path
When a new request arrives, the scheduler walks the request's token sequence block by block from the first block onward. For each block, it computes the hash and probes the global table.
sequenceDiagram
participant Sched as Scheduler
participant GHT as Global Hash Table
participant Pool as Physical Block Pool
participant Req as Request State
Sched->>GHT: lookup(h_0)
GHT-->>Sched: HIT → physical block P3
Sched->>P3: increment refcount
Sched->>Req: map logical[0] → P3
Sched->>GHT: lookup(h_1)
GHT-->>Sched: HIT → physical block P7
Sched->>P7: increment refcount
Sched->>Req: map logical[1] → P7
Sched->>GHT: lookup(h_2)
GHT-->>Sched: MISS
Sched->>Pool: allocate new physical block P11
Sched->>Req: map logical[2] → P11
Note over Sched,Pool: Compute KV for block 2, then register h_2→P11 in GHT
The prefix walk stops at the first cache miss. All blocks at positions beyond the miss are novel and must be computed. The matched prefix blocks are mapped directly into the request's logical block table with their refcounts incremented — the GPU never recomputes those KV tensors. The vLLM design doc says that matched logical blocks reuse the same physical blocks across requests.
Reference counts and lifetime management in the block table
Every physical block in the global hash table carries two metadata fields: a reference count and a last-access timestamp. The reference count tracks how many active requests currently have this block mapped into their logical block table. The last-access timestamp records when the block was most recently used — either when it was first populated or when a new request claimed it from the hash table.
These two fields interact to determine when a block is safe to reclaim. A block with refcount ≥ 1 is pinned: no eviction policy can touch it. A block with refcount 0 is a candidate, but the specific candidate selected for eviction depends on LRU ordering and prefix-depth tie-breaking as a secondary and tertiary sort. "If there are multiple blocks with reference count equals to 0, we prioritize to evict the least recently used block (LRU)."
Pro Tip: If profiling reveals unexpectedly high cache miss rates despite long shared prefixes, inspect whether your request concurrency pattern causes refcounts to drop to zero mid-batch. A shared system prompt block that cycles between refcount 1 and 0 at high frequency will appear frequently in the LRU eviction candidate set, and a concurrent burst of dissimilar requests can evict it before the next batch of prefix-sharing requests arrives. Pinning high-value prefix blocks via a higher steady-state refcount — achieved by keeping at least one "warm" inflight request that holds them — can stabilize cache hit rates.
When a block reaches reference count zero
A block's refcount drops to zero when the last request that referenced it completes or is preempted and its logical block table is torn down. At that point, the block transitions from "live" to "evictable" — it remains in the global hash table with its hash key intact, but it is now eligible for reclamation.
Being evictable does not mean immediate eviction. The block stays in the table as long as physical memory pressure does not require a new allocation. If a subsequent request generates the same hash before the block is reclaimed, the engine increments the refcount back to 1 and reuses the block at zero cost. This is the core temporal reuse window: the period between a block's refcount reaching zero and the allocator selecting it as the LRU victim.
Watch Out: Do not assume that a block with refcount 0 is immediately safe to repurpose from user space or external tooling. vLLM's allocator retains the mapping in the hash table until it explicitly selects and clears the block for a new allocation. Premature external interference with physical GPU memory at those addresses (e.g., via direct CUDA IPC without synchronization with vLLM's allocator) can corrupt the KV state of the next request that hits that cache entry before it is invalidated.
How shared prefixes stay alive across concurrent requests
When two requests share a prefix, both hold a reference to the same physical block — the refcount is 2. The block cannot be evicted as long as either request is inflight. As each request completes, the refcount decrements; the block becomes evictable only when the last sharing request finishes.
This eliminates the need for a shared prefix tree. In a tree-based design, the engine must track which nodes in the tree are shared, synchronize on tree mutations, and handle the case where a branch is partially alive. The hash table design makes sharing implicit: any two requests that independently compute the same block hash for the same token content simply resolve to the same table entry and increment the same refcount counter. As the vLLM design doc notes, this structure "removes the need for a tree of shared prefixes."
Pro Tip: For workloads with a single long shared system prompt (RAG pipelines, tool-call agents with large context), the prefix blocks for that prompt will maintain a refcount proportional to your active request concurrency. At high concurrency, those blocks are effectively permanently pinned — they will never appear in the eviction candidate pool. Design your KV cache budget assuming the shared prefix consumes a fixed, non-evictable portion of physical memory.
Eviction order: reference count, LRU, then longer-prefix tie-breaking
When the physical block pool is exhausted and a new block must be allocated, vLLM's eviction policy applies a strict three-level ordering. The vLLM design doc specifies this ordering explicitly, and it is more precise than competitors' descriptions of a generic "LRU cache":
| Priority | Criterion | Effect |
|---|---|---|
| 1st | Reference count = 0 | Only unreferenced blocks enter the eviction pool |
| 2nd | Least recently used (LRU) | Among refcount-0 blocks, oldest last-access time evicted first |
| 3rd | Longer prefix (tie-breaker) | When LRU timestamps are equal, deeper-prefix blocks are retained |
Why refcount-zero blocks get evicted first
The refcount gate is a correctness constraint, not a performance heuristic. Evicting a block with refcount ≥ 1 would invalidate the logical-to-physical mapping for a live request, producing wrong attention computations or a hard fault. The design doc makes the dependency explicit: only blocks with refcount 0 are eligible.
| Block state | Refcount | Eviction eligible? |
|---|---|---|
| Active (≥ 1 request referencing) | ≥ 1 | No — pinned |
| Idle (no active requests) | 0 | Yes — enters LRU pool |
| Newly allocated (not yet in GHT) | 1 | No — pinned to allocating request |
Watch Out: Any serving scenario that keeps a non-trivial fraction of physical blocks at refcount ≥ 1 at steady state — such as very long-lived streaming requests — shrinks the effective eviction pool. Under high memory pressure, the allocator may find no refcount-0 candidates, forcing it to preempt an active request and evict its blocks by force. Monitor the ratio of pinned to evictable blocks in your deployment; a cluster consistently above ~80% pinned-block utilization will exhibit request preemption rather than graceful eviction.
How LRU ordering works after refcount filtering
After the refcount gate produces the set of evictable blocks, the allocator selects the victim using last-access time. As the vLLM design doc states: "If there are multiple blocks with reference count equals to 0, we prioritize to evict the least recently used block (LRU)."
| Evictable block | Last access | Selected for eviction? |
|---|---|---|
| Block A (refcount 0) | 10 seconds ago | Yes — oldest |
| Block B (refcount 0) | 3 seconds ago | No |
| Block C (refcount 0) | 1 second ago | No |
| Block D (refcount 1) | 15 seconds ago | No — pinned |
LRU reflects the intuition that blocks not accessed recently are less likely to be reused soon. For workloads with a warm shared prefix that is queried continuously, those prefix blocks will have recent last-access times and will rank toward the bottom of the eviction list — they survive pressure even when their refcount momentarily drops to zero between request batches.
Why longer prefixes win as the final tie-breaker
When two or more evictable blocks share identical LRU timestamps — which can happen in batch-admission scenarios where multiple blocks were populated at the same wall-clock time — the design uses prefix depth as the final ranking criterion. Blocks that represent deeper prefixes (more tokens before them in the sequence) are retained over shallower ones.
| Block | Refcount | Last access | Prefix depth | Evicted first? |
|---|---|---|---|---|
| Block at depth 0 (first block) | 0 | tie | shallow | Yes — lower value |
| Block at depth 3 (4th block) | 0 | tie | deeper | No — higher value |
The rationale is asymmetric reuse cost: a shallow block (e.g., block 0 of a shared prefix) is relatively cheap to regenerate if evicted — it covers fewer tokens and can be recomputed quickly. A deep prefix block represents more accumulated computation. Retaining the deeper block preserves the larger computational investment at the cost of evicting the smaller one.
Where the design stops short: partial blocks and non-text extensions
Two structural limitations constrain automatic prefix caching in vLLM: partial blocks cannot participate in the hash-based lookup, and non-text modalities require explicit hash extensions to maintain cache isolation.
Watch Out: Partial blocks — blocks that are not completely filled with tokens — are excluded from the content-addressed cache. Because the hash is computed over a fixed tuple of $B$ token IDs, a block containing fewer than $B$ tokens has no stable hash and cannot be registered in the global hash table. This means the final block of any prefix (which is almost always partial unless the prefix length is an exact multiple of the block size) receives no cache benefit. For a 16-token block size and a 4,097-token system prompt, blocks 0–255 (4,096 tokens) are cacheable; the 4,097th token sits in a partial block and forces re-computation on every request.
Why partial KV blocks are a deliberate edge case
The design doc's block-identification rule — "each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block" — requires the block to be complete. A partial block at position $i$ does not have a fixed token tuple; it changes as the request generates more tokens. Hashing an in-progress block would require invalidating and re-registering its hash table entry with each new token, creating a write-heavy hot path that undermines the $O(1)$ lookup guarantee.
The practical consequence: workloads where prefix lengths are not multiples of the block size always pay full compute cost for the trailing partial block. Aligning system prompt lengths to block-size boundaries is a concrete optimization available to serving teams. If block size is 16 and the system prompt is 4,090 tokens, padding to 4,096 tokens (256 complete blocks) makes the entire prompt cacheable.
Watch Out: Do not interpret the partial-block exclusion as a minor edge case. For conversational multi-turn scenarios, the "current turn" being assembled token-by-token is always a partial block until it completes. Only prior completed turns benefit from prefix caching — the live generation context does not.
What multi-LoRA and multimodal caching imply for the model
The base hash function covers text token sequences only. When serving multiple LoRA adapters or processing multimodal inputs, the same token sequence can produce different KV tensors depending on the adapter weights or the embedding of the non-text input. Two requests with identical text tokens but different LoRA adapters must not share a physical block.
The vLLM documentation on prefix caching extensions addresses this with supplementary hash components: "Extra hashes: Other values required to make this block unique, such as LoRA IDs, multi-modality input hashes … and cache salts to isolate caches in multi-tenant environments." The block hash becomes a composite key:
$(h_i^{\text{extended}} = \text{hash}!\left(\text{prefix_tokens},\; \text{block_tokens},\; \text{lora_id},\; \text{modality_hash},\; \text{tenant_salt}\right))$
Pro Tip: In multi-LoRA deployments, cache hit rates are adapter-scoped. Two requests sharing a system prompt but targeting different adapters will never share a physical block — their composite hashes diverge at the LoRA ID component. Size your KV cache budget per adapter, not per deployment. For multimodal workloads, the
modality_hashcomponent means that image-prefixed requests only share blocks when both the image content and the text prefix are identical — a much narrower sharing opportunity than text-only scenarios.
Practical implications for serving teams
Automatic prefix caching benefits are real but bounded. The mechanism eliminates redundant KV computation for shared prefixes, which reduces TTFT proportionally to the fraction of the prompt that is cached. For a 4,096-token shared system prompt followed by a 128-token user query, a cache hit means the engine runs attention only over 128 novel tokens instead of 4,224 — the TTFT reduction is proportional to the compute saved on the 4,096-token prefix. A separate analysis of KV-cache management confirms that "when managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size," which is the failure mode prefix caching directly targets.
Production Note: Memory headroom on 80 GB HBM is not uniformly reclaimed by prefix caching. The shared prefix blocks consume physical memory whether one request or one hundred requests reference them. The savings appear in avoided allocations for new requests: instead of allocating (N_{\text{prefix}}) fresh blocks per request, the engine allocates zero additional blocks for the shared portion. This translates directly to higher achievable batch concurrency — more requests can be inflight simultaneously within the same physical memory budget — which increases throughput under load, not just TTFT for individual requests.
What prefix caching can and cannot fix in production
Prefix caching reduces TTFT for requests whose prefix matches a cached block sequence. It does not reduce decode latency (time-per-output-token), because the decode phase generates novel tokens with no prefix to reuse. It does not reduce memory for the novel portion of each request — new blocks still require full allocation. And it cannot help when the shared prefix is shorter than one block size.
Production Note: The ROI on prefix caching is highest in three specific workload shapes: (1) high-concurrency deployments with a large fixed system prompt, where the shared prefix blocks achieve near-permanent pinning; (2) multi-turn chat where each turn's prior context is an exact prefix of the next turn's input, allowing progressive block reuse; and (3) batch inference over the same document corpus, where per-document KV blocks can be shared across multiple queries about the same document. The ROI is near-zero for workloads with entirely unique per-request prompts, short prompts under one block size, or prompts that vary in the first block (breaking the prefix chain at depth 0).
What to watch in profiling and cache-hit analysis
vLLM exposes per-request cache hit metrics through its scheduler logs and OpenAI-compatible API response metadata. The critical ratio is the number of cached blocks divided by the total blocks in the prompt — this is the effective prefix hit rate. A hit rate below 50% on a workload that should benefit from prefix caching typically indicates one of three root causes: prefix length not aligned to block boundaries (partial-block waste), high request diversity breaking prefix chains at shallow depth, or memory pressure causing LRU eviction of shared prefix blocks before the next batch arrives.
Pro Tip: When chasing TTFT regressions after enabling long context (e.g., expanding from 8K to 128K context window), the first variable to examine is whether the enlarged context fits in the available KV budget without pushing high-value prefix blocks out of the LRU pool. A 128K-token context fills 8,192 blocks at block size 16 — a single inflight long-context request can evict shared prefix blocks that dozens of concurrent short-context requests depend on. Segment your physical block pool budget between long-context and short-context request classes before attributing TTFT regressions to other causes.
Does prefix caching change model outputs?
Prefix caching does not alter model outputs for identical prompts under correct operation. The mechanism reuses pre-computed KV tensors — the same tensors the model would have computed fresh — so attention over the cached prefix produces bit-identical results to a full recomputation. The vLLM design doc positions prefix caching as an optimization that avoids redundant computation, not one that approximates or compresses it.
The only scenario where output could diverge is a hash collision — two distinct token sequences producing the same block hash — which maps to a wrong physical block. Cryptographically sound hash functions make this probability negligible in practice. A second source of apparent divergence is non-deterministic sampling: if temperature > 0, output tokens vary per run regardless of caching, and operators may incorrectly attribute sampling variance to the caching mechanism.
Watch Out: Prefix caching interacts with cache salts in multi-tenant environments. If a deployment uses tenant-specific cache salts (as the vLLM docs describe for isolation), requests from different tenants with identical prompts will not share blocks — by design. Removing or misapplying the salt can cause cross-tenant block sharing, which is a security boundary violation even if outputs appear correct from a model-accuracy standpoint. Treat cache salts as a mandatory isolation mechanism, not an optional optimization.
FAQ and source references
Q: How does prefix caching work in vLLM? vLLM computes a hash for each complete KV block using the block's tokens and all preceding tokens, then stores the physical block in a global hash table keyed by that hash. Subsequent requests with matching prefixes resolve those hash lookups and map directly to the existing physical blocks, skipping re-computation.
Q: What is automatic prefix caching? It is vLLM's built-in mechanism to eliminate redundant KV-cache computation for shared prompt prefixes, operating transparently at the block level without requiring changes to request format or model code.
Q: How does vLLM choose which KV cache blocks to evict? The eviction policy applies three criteria in order: (1) only blocks with refcount 0 are eligible, (2) among those, the least recently used block is evicted first, (3) when LRU timestamps tie, shallower-prefix blocks are evicted before deeper-prefix blocks.
Q: Does prefix caching affect model outputs? Not for identical prompts under correct hash function behavior. Cached KV tensors are mathematically identical to freshly computed tensors. Output variation after enabling prefix caching is almost always attributable to non-deterministic sampling, not the caching mechanism.
Q: What are the limitations of prefix caching in vLLM? The three principal limitations are: (1) partial blocks — the final block of any prefix is uncacheable unless the prefix length aligns to block size; (2) modality scope — multi-LoRA and multimodal workloads require composite hash keys and reduce the sharing opportunity; (3) memory pressure — under high concurrency with large long-context requests, LRU eviction can displace shared prefix blocks, eliminating the cache hit entirely.
| Aspect | vLLM Automatic Prefix Caching | SGLang RadixAttention | TensorRT-LLM Prefix Caching |
|---|---|---|---|
| Lookup structure | Global hash table (flat, $O(1)$) | Radix tree ($O(L)$ traversal) | Implementation-dependent |
| Partial block handling | Not cached | Tree node may be partial | Not publicly documented |
| Eviction ordering | refcount → LRU → prefix depth | LRU-based tree pruning | Not publicly documented |
| Multi-LoRA support | Composite hash key with LoRA ID | Not documented in public spec | Not publicly documented |
| Multimodal support | Composite hash with modality hash | Not documented in public spec | Not publicly documented |
Sources & References
- vLLM Automatic Prefix Caching Design Doc (v0.7.2) — canonical architectural specification for block hashing, reference counting, and eviction ordering
- vLLM Prefix Caching Documentation (latest) — extended design covering composite hash keys for LoRA IDs, multimodal inputs, and tenant cache salts
- Efficient Memory Management for Large Language Model Serving with PagedAttention (arXiv 2309.06180) — original PagedAttention paper documenting KV-cache fragmentation and the non-contiguous block allocation model
- Scalable LLM Serving Survey (arXiv 2601.03067) — independent confirmation of vLLM's fixed-size block partitioning design
Keywords: vLLM 0.7.2, PagedAttention, automatic prefix caching, KV cache, logical block table, physical block table, reference count, LRU eviction, hash(prefix_tokens + block_tokens), SGLang, TensorRT-LLM, longer-prefix tie-breaking, multi-LoRA, multimodal caching, GPT-4o



