AI & ML

How to enable FP8 KV cache quantization in vLLM without breaking prefix caching

Q: How do I enable prefix caching in vLLM?

Pass `--enable-prefix-caching` to `vllm serve`. Prefix caching is off by default. When enabled, vLLM hashes each KV-cache block by its token content and the prefix before it; identical prefixes reuse cached blocks rather than recomputing them during prefill. This flag is compatible with `--kv-cache-dtype fp8` on CUDA hardware.

Q: How do I calibrate FP8 KV cache scales in vLLM?

Use `llm-compressor` for the recommended dataset-based path: run the calibration script against your model and a representative dataset, then point `vllm serve` at the output directory with `--kv-cache-dtype fp8`. For quick calibration without an offline pass, add `calculate_kv_scales=True` to compute scales from random tokens at startup.

vLLM’s FP8 KV cache can coexist with prefix caching because automatic prefix cache keys are still block-hash based — but on ROCm/W7900 the combination has a documented crash path, so the safe article must show the exact FP8 calibration path and the validation checks that prove prefix cache reuse still works.

By AxiomLogica Editorial

May 2, 202618 min read

Reviewed by Editorial

How to enable FP8 KV cache quantization in vLLM without breaking prefix caching

At a Glance: safe FP8 KV cache setup in vLLM

At a Glance: Time: ~30–90 min depending on calibration path · Prereqs: vLLM installed, a supported GPU, familiarity with vllm serve flags · Hardware: any vLLM-supported CUDA GPU for the core workflow; AMD Radeon W7900 / RDNA3 users must read the ROCm crash caveat before proceeding · Cost: no additional license cost; dataset calibration requires a one-time CPU/GPU pass through llm-compressor

vLLM supports FP8 KV cache quantization and prefix caching simultaneously — with one critical exception. On AMD Radeon W7900 / RDNA3 under ROCm 6.3 with vLLM 0.7.2, enabling both --kv-cache-dtype fp8 and --enable-prefix-caching together causes a process crash. On all other supported stacks, the combination works because prefix cache keying is block-hash based, entirely independent of the KV dtype stored in those blocks.

The official vLLM quantized KV cache documentation describes three enablement modes: raw kv_cache_dtype="fp8" with no calibration, on-the-fly random-token calibration via calculate_kv_scales=True, and the recommended highest-quality path using dataset-based calibration through llm-compressor. This article walks each mode to production, shows the validation sequence that proves prefix cache reuse survives the dtype change, and gives you the isolation procedure for the ROCm crash case.

What this setup changes in the serving path

FP8 KV cache quantization changes one thing in the vLLM serving path: the in-memory representation of key and value tensors written to and read from PagedAttention's block allocator. The rest of the serving path — tokenization, attention computation in FP16/BF16, sampling, output formatting — is unchanged.

Production Note: The memory-savings vs. accuracy trade-off is calibration-dependent. Uncalibrated FP8 (kv_cache_dtype="fp8" alone) reduces KV footprint by roughly half compared to BF16 storage, at the cost of uncontrolled quantization error. Dataset-calibrated FP8 via llm-compressor achieves the same memory reduction with minimized accuracy regression because per-layer scale factors are derived from representative activations rather than a fixed range. As vLLM's docs state, "quantizing the KV cache to FP8 reduces its memory footprint" — but quality depends on how you compute the scales. Prefix cache reuse behavior is unaffected by KV dtype because the cache lookup path operates on block hashes, not on the tensor values themselves.

Why KV cache memory becomes the bottleneck

KV cache is the dominant consumer of HBM during inference at long context. For each token in the context, every layer stores a key tensor and a value tensor — both of size [num_heads, head_dim] per sequence. At BF16, a 70B-class model with 128 layers, 8 KV heads, and a head dimension of 128 stores 4 MB of KV data per 1,000 tokens per sequence. At 32K context, one sequence consumes 128 MB of KV cache; at batch size 8, that is 1 GB before weights. FP8 halves this, directly expanding the number of concurrent requests that fit in a fixed HBM budget.

Pro Tip: The capacity benefit is non-linear under prefix caching. When a shared system prompt occupies 8K tokens, FP8 KV cache halves the cost of storing those cached blocks. Every new request that hits a prefix cache block pays nothing extra for those first 8K tokens. Combining FP8 with prefix caching compounds the memory efficiency of both features — which is exactly why confirming they coexist correctly matters.

As vLLM v0.8.0 documentation noted, FP8 KV cache "increases the number of tokens that can be stored in the cache, improving throughput" — the direct lever for higher concurrency under long context.

Where prefix caching still fits in the request lifecycle

Prefix caching in vLLM operates at the block-allocator layer of PagedAttention. When a request arrives, vLLM computes a hash for each KV-cache block based on the token IDs within that block and the token IDs of the prefix preceding it. If a block with the same hash exists in cache, vLLM reuses it directly, skipping the prefill computation for that block.

Watch Out: Prefix caching depends entirely on block-hash reuse, not on the KV dtype stored in the block. Switching to FP8 changes the values inside the blocks but does not change the hash key computation. A block that was cached under BF16, however, is not the same physical block as an FP8 block — if you toggle kv_cache_dtype between runs, existing cached blocks will not produce hash collisions with new requests, effectively cold-starting the prefix cache. Restart the server and warm the cache under the target dtype before measuring hit rates.

As vLLM's prefix caching design docs explain, "vLLM chooses a hash-based approach" where "each kv-cache block [is hashed] by the tokens in the block and the tokens in the prefix before the block." That design makes the caching mechanism dtype-agnostic by construction.

Prerequisites and calibration inputs

Before starting, decide which calibration path fits your deployment. The three modes differ in setup cost and output quality:

# Calibration mode decision
modes:
  no_calibration:
    flag: kv_cache_dtype: "fp8"
    additional_flags: none
    quality: lowest (uncontrolled quantization error)
    setup_cost: zero

  random_token_calibration:
    flag: kv_cache_dtype: "fp8"
    additional_flag: calculate_kv_scales: true
    quality: moderate (scale derived from random tokens at startup)
    setup_cost: ~minutes at server launch

  dataset_calibration:
    tool: llm-compressor
    flag: kv_cache_dtype: "fp8"  # required at serve time
    quality: highest (scales derived from representative data)
    setup_cost: one-time calibration run, produces a quantized model artifact

As vLLM's quantized KV cache docs state: "For the highest-quality quantization, we recommend calibrating against a dataset using llm-compressor." The calibration artifact produced by llm-compressor is loaded directly by vLLM — no separate scale-injection step.

When raw fp8 is acceptable

Raw FP8 is acceptable for proof-of-concept deployments, internal benchmarking where absolute accuracy is not a constraint, or situations where you need to validate that the serving stack handles FP8 KV blocks at all before investing in calibration.

# Minimal vLLM config for uncalibrated FP8 KV cache
# Use as a serve argument or in a YAML config file
kv_cache_dtype: "fp8"

Per vLLM docs: "No Calibration: Use kv_cache_dtype=\"fp8\" directly." No scale files, no pre-processing — the server starts and stores all KV tensors in FP8 using a fixed or internally-derived range. The docs are explicit that this is not the recommended path for production quality.

When to use random-token scale calibration

Random-token calibration is the middle path: it computes per-layer KV scale factors at server startup using a sample of random tokens, without requiring an offline calibration run. Use it when you need better quantization quality than raw FP8 but cannot run a full dataset calibration pass.

# Random-token calibration: compute KV scales on the fly
kv_cache_dtype: "fp8"
calculate_kv_scales: true  # triggers on-the-fly scale computation at startup

As vLLM v0.9.2 docs describe: "To calculate kv cache scales on the fly enable the calculate_kv_scales parameter." The scale computation happens once at launch and is then frozen for the session. This is a calibration surrogate — scales are derived from random tokens rather than domain-representative data, so they may be suboptimal for specialized domains.

When dataset-based calibration is the preferred path

Dataset calibration via llm-compressor is the recommended production path, and it is where the competitive gap in most documentation lies. Other guides explain that llm-compressor exists; few connect its output to the prefix-caching validation that proves the combination works end-to-end.

The calibration workflow uses your target model (e.g., Llama 3.1 8B Instruct) and a representative dataset to compute per-layer FP8 KV scale factors. The output is a quantized model directory that vLLM loads directly.

# Install llm-compressor
$ pip install llmcompressor

# Run dataset-based KV cache calibration
# Adapt the script from the official example:
# https://docs.vllm.ai/projects/llm-compressor/en/latest/examples/quantization_kv_cache/
$ python quantize_kv_cache.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dataset <your-calibration-dataset> \
    --output-dir ./Llama-3.1-8B-Instruct-FP8-KV

As the llm-compressor KV cache quantization example documents: "The resulting model Meta-Llama-3-8B-Instruct-FP8-KV is ready to be loaded into vLLM." The calibration script produces a model directory containing both the original weights and the computed KV scale tensors embedded in the model's config artifacts. After calibration, you point vLLM at this directory and set kv_cache_dtype="fp8" — the scales are loaded automatically.

The calibration data should be representative of your production traffic. Random or generic calibration data on a specialized model (medical, legal, code) will produce suboptimal scales. Use a sample of 512–1024 real requests when possible.

Enable FP8 KV cache in vLLM

All three calibration paths converge on the same runtime requirement: kv_cache_dtype="fp8" must be set at serve time. As vLLM's docs state: "When running the model you must specify kv_cache_dtype=\"fp8\" in order to enable the kv cache quantization and use the scales."

Minimal launch for uncalibrated fp8

# Minimal uncalibrated FP8 KV cache launch
# Add --enable-prefix-caching if your stack is not the affected ROCm/W7900 combination
$ vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --kv-cache-dtype fp8

This is sufficient to verify that vLLM accepts the dtype flag and starts serving. No scale files are required. Suitable for development validation; not for production quality-sensitive deployments.

Launch with calibrated scales

After running llm-compressor calibration, point vLLM at the output directory. The calibrated model artifact includes embedded KV scale tensors that vLLM reads automatically when --kv-cache-dtype fp8 is set.

# Launch with dataset-calibrated FP8 KV cache
# The calibrated model dir contains embedded scale artifacts from llm-compressor
$ vllm serve ./Llama-3.1-8B-Instruct-FP8-KV \
    --kv-cache-dtype fp8 \
    --enable-prefix-caching

Production Note: The --enable-prefix-caching flag is safe on CUDA-based hardware. On AMD Radeon W7900 / RDNA3 under ROCm 6.3 with vLLM 0.7.2, omit this flag until you have completed the isolation test in the ROCm crash section below. The calibrated model directory name follows the pattern your llm-compressor script defines — adjust ./Llama-3.1-8B-Instruct-FP8-KV to match your actual output path.

What changes in logs and server behavior after enablement

The primary startup confirmation is in vLLM's initialization output. When FP8 KV cache loads correctly, the server echoes back the resolved configuration including the KV dtype. There is no silent fallback to BF16 — if the dtype is unsupported on your hardware, vLLM raises an error at startup rather than silently degrading.

Production Note: Confirm FP8 KV cache is active by checking three signals: (1) the startup log echoes kv_cache_dtype=fp8 in the engine configuration summary; (2) if calibrated scales were loaded, the log references the scale tensors from the model artifact; (3) GPU memory consumption at idle is lower than a BF16 baseline with the same --max-model-len and --gpu-memory-utilization settings — the KV cache pool is smaller, leaving more headroom for the block allocator. Because vLLM's prefix caching operates on block hashes rather than KV values, you should see identical prefix-cache hit behavior in logs once the server is warm.

Validate that prefix caching still reuses blocks

This is the section most FP8 KV cache documentation skips. Confirming that prefix caching works under FP8 requires sending a controlled two-request sequence and observing cache-hit evidence in logs or metrics. Since vLLM's prefix cache keys are "block-hash based" — hashing each block from its token content and the prefix before it — reuse should be deterministic for identical prompts regardless of KV dtype.

Run a two-request reuse test

Send the same prompt twice to a warm server. The first request fills the prefix cache; the second should hit it.

# Two-request prefix cache reuse test against the OpenAI-compatible server
# Requires: pip install openai
import time
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

SHARED_PREFIX = (
    "You are a precise technical assistant. "
    "Answer the following question concisely and accurately.\n\n"
    "Question: What is the boiling point of water at sea level?"
)

for i in range(2):
    t0 = time.perf_counter()
    response = client.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        prompt=SHARED_PREFIX,
        max_tokens=64,
        temperature=0.0,  # deterministic output simplifies response comparison
    )
    elapsed = time.perf_counter() - t0
    print(f"Request {i+1}: {elapsed:.3f}s — {response.choices[0].text.strip()[:80]}")

# Inspect vLLM server logs for prefix cache reuse signals
# Run with verbose logging enabled at startup to capture scheduler events
$ vllm serve ./Llama-3.1-8B-Instruct-FP8-KV \
    --kv-cache-dtype fp8 \
    --enable-prefix-caching \
    --log-level debug 2>&1 | tee vllm_serve.log

# After running the two-request test, search for cache hit indicators
$ grep -i "prefix\|cache_hit\|cached_blocks\|hit" vllm_serve.log

The second request's time-to-first-token should be lower than the first when the shared prefix covers full KV blocks. The exact magnitude depends on prefix length and hardware, but the reduction is observable at prompt lengths beyond one block (typically 16–32 tokens depending on block size configuration).

Confirm the prefix hash path is unchanged

Pro Tip: The KV dtype does not enter the prefix hash computation anywhere in vLLM's documented design. As the prefix caching docs state: "we hash each kv-cache block by the tokens in the block and the tokens in the prefix before the block." The hash inputs are token IDs — integers — not the floating-point KV tensors themselves. This means the block-reuse decision is identical whether the blocks are stored in BF16 or FP8. The only correctness concern is whether the FP8-quantized values, when dequantized for attention computation, produce outputs close enough to BF16 — which is a calibration quality question, not a prefix-cache keying question.

What a successful reuse looks like

Production Note: A successful reuse test shows three things: (1) the second request completes faster than the first for a long shared prefix (observable TTFT reduction); (2) the text outputs for both requests are identical or near-identical under temperature=0.0 — small numerical differences may appear due to FP8 quantization of the cached values, but the responses should be semantically equivalent; (3) no errors or warnings in the server logs related to cache-block dtype mismatch. If outputs diverge substantially between the two requests under deterministic sampling, the calibration quality needs investigation — not the prefix-caching mechanism.

Known crash path on ROCm and AMD Radeon W7900

The combination of --kv-cache-dtype fp8 and --enable-prefix-caching has a documented crash path on AMD Radeon W7900 / RDNA3 hardware running ROCm 6.3 under vLLM 0.7.2. This is an interaction bug, not a fundamental incompatibility — both features work individually on the affected stack.

Watch Out: If you run vLLM on AMD Radeon W7900 / RDNA3 with ROCm 6.3, do not enable both --kv-cache-dtype fp8 and --enable-prefix-caching simultaneously without first completing the isolation test below. Per vLLM GitHub issue #13147, the process crashes when both flags are active on this stack. Either flag alone avoids the crash.

The reported failure combination

Watch Out: The confirmed crash combination from vLLM issue #13147 is: AMD Radeon W7900 / RDNA3 + ROCm 6.3 + vLLM 0.7.2 + --kv-cache-dtype fp8 + --enable-prefix-caching. As the issue records: "Either removing --enable-prefix-caching or --kv-cache-dtype fp8 solves the problem." This is an interaction bug specific to the named stack. NVIDIA CUDA hardware and other AMD configurations are not reported as affected by this issue.

How to isolate whether fp8 or prefix caching is the trigger

Run a 2×2 isolation matrix with one flag toggled at a time. This separates a dtype-specific failure from a prefix-cache-specific failure from an interaction-only failure.

# Test 1: baseline — neither flag
$ vllm serve meta-llama/Llama-3.1-8B-Instruct

# Test 2: FP8 only — no prefix caching
$ vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --kv-cache-dtype fp8

# Test 3: prefix caching only — no FP8
$ vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --enable-prefix-caching

# Test 4: both flags — the reported crash combination on W7900/ROCm 6.3
$ vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --kv-cache-dtype fp8 \
    --enable-prefix-caching

If Test 4 crashes and Tests 2 and 3 succeed, the failure is a confirmed interaction bug matching issue #13147. If Test 2 or 3 also crashes, the problem is not the combination — investigate hardware-specific FP8 support or prefix-caching configuration separately. Run each test on the same input request to make the comparison controlled.

Safe fallback when the environment is affected

Production Note: On the affected ROCm/W7900/RDNA3 stack running vLLM 0.7.2, the operational safe choice is to enable one feature at a time. For memory-constrained deployments where FP8 KV cache provides the higher per-request value, disable --enable-prefix-caching and use --kv-cache-dtype fp8 alone. For workloads with high prefix reuse (chatbots, RAG with a fixed system prompt, multi-turn sessions), disable --kv-cache-dtype fp8 and use --enable-prefix-caching alone in BF16. Before adopting the workaround permanently, check whether a newer vLLM release resolves the interaction — the bug was filed against vLLM 0.7.2 and ROCm 6.3 specifically; later releases may have patched it.

Troubleshooting and verification checklist

After enabling FP8 KV cache with prefix caching, verify the following before declaring the deployment production-ready:

# Verification command sequence

# 1. Confirm FP8 dtype is echoed in startup config
$ grep -i "kv_cache_dtype\|kv cache dtype" vllm_serve.log

# 2. Confirm prefix caching is enabled in startup output
$ grep -i "prefix_caching\|prefix caching\|enable.prefix" vllm_serve.log

# 3. Send two identical requests and observe timing delta
$ python reuse_test.py  # the script from the two-request test section

# 4. Check for any dtype mismatch or scale-loading errors
$ grep -iE "error|warn|scale|dtype" vllm_serve.log | head -40

# 5. On NVIDIA hardware, compare KV cache pool size against a BF16 baseline
#    Lower pool memory usage at the same --max-model-len confirms FP8 is active
$ nvidia-smi --query-gpu=memory.used --format=csv

If scales look off or quality regresses

Watch Out: Quality regression after enabling FP8 KV cache almost always traces back to uncalibrated or poorly calibrated scales. If you launched with kv_cache_dtype="fp8" and no scale artifacts, switch to dataset-based calibration using llm-compressor as the first remediation step. As vLLM's docs state, dataset calibration is the path to "highest-quality quantization" because scales are derived from domain-representative activations. If you used calculate_kv_scales=True but calibration data was unrepresentative of your traffic, re-run the offline calibration pass with a sample from actual production prompts. Do not adjust quantization parameters without re-running the full calibration — partial scale adjustment compounds quantization error.

If prefix hits never appear

If the two-request reuse test shows no TTFT reduction and the logs show no cache-hit events, work through this sequence:

# Check 1: confirm --enable-prefix-caching was actually passed
$ grep -i "enable.prefix\|prefix_caching" vllm_serve.log

# Check 2: verify the server is not restarted between the two requests
# (prefix cache is in-memory; a server restart cold-starts it)
$ ps aux | grep vllm

# Check 3: confirm both requests use the same model and the same tokenization
# Prefix hash is token-ID based — any prompt encoding difference defeats reuse
$ python -c "
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained('meta-llama/Llama-3.1-8B-Instruct')
print(tok.encode('Your shared prefix here'))
"

# Check 4: on the ROCm/W7900 stack, if the server crashed silently,
# both requests may have hit a cold server — check for restart events
$ grep -i "startup\|initialized\|started" vllm_serve.log | tail -10

If you are on the ROCm/W7900/RDNA3 stack and the server shows no crash but also no cache hits, revisit the isolation test — a silent recovery or restart may be masking the crash described in issue #13147.

FAQ

What is FP8 KV cache quantization in vLLM?

FP8 KV cache quantization stores the key and value tensors of the attention mechanism in 8-bit floating-point format rather than 16-bit. vLLM's implementation supports three modes — uncalibrated, random-token calibration, and dataset calibration — controlled via kv_cache_dtype="fp8". The feature reduces KV memory footprint, allowing more tokens or more concurrent sequences to fit in a fixed HBM budget.

How do I enable prefix caching in vLLM?

Pass --enable-prefix-caching to vllm serve. Prefix caching is off by default. When enabled, vLLM hashes each KV-cache block by its token content and the prefix before it; identical prefixes reuse cached blocks rather than recomputing them during prefill. This flag is compatible with --kv-cache-dtype fp8 on CUDA hardware.

How do I calibrate FP8 KV cache scales in vLLM?

Use llm-compressor for the recommended dataset-based path: run the calibration script against your model and a representative dataset, then point vllm serve at the output directory with --kv-cache-dtype fp8. For quick calibration without an offline pass, add calculate_kv_scales=True to compute scales from random tokens at startup.

Does prefix caching work with quantized KV cache?

Yes, on CUDA hardware. vLLM's prefix cache keying is block-hash based — it hashes token IDs, not KV tensor values — so the cache lookup mechanism is independent of the KV dtype. The exception is AMD Radeon W7900 / RDNA3 under ROCm 6.3 with vLLM 0.7.2, where the combination crashes.

Why does vLLM crash when prefix caching is enabled?

On the specific stack of AMD Radeon W7900 / RDNA3, ROCm 6.3, and vLLM 0.7.2, enabling both --enable-prefix-caching and --kv-cache-dtype fp8 triggers a process crash. This is a reported interaction bug. Either flag alone avoids the crash on that stack. The root cause is hardware/driver-specific, not a design flaw in prefix caching or FP8 KV cache as features.

Pro Tip: The safest enablement sequence on any stack is: (1) start with --kv-cache-dtype fp8 alone and confirm startup; (2) run the two-request reuse test with --enable-prefix-caching added; (3) only then treat the combination as production-validated. On ROCm/W7900, complete the 2×2 isolation matrix before combining flags.

Sources & References

vLLM Quantized KV Cache documentation — primary reference for the three calibration modes, kv_cache_dtype="fp8", calculate_kv_scales=True, and llm-compressor integration.
vLLM Automatic Prefix Caching design docs — source for block-hash-based prefix cache keying and independence from KV dtype.
llm-compressor KV cache quantization example — step-by-step dataset calibration workflow and model artifact format.
vLLM GitHub issue #13147 — reported crash path: FP8 KV cache + prefix caching on AMD Radeon W7900 / RDNA3 / ROCm 6.3 / vLLM 0.7.2.
vLLM v0.8.0 Quantized KV Cache docs (historical) — source for throughput and token-capacity framing of FP8 KV cache.
vLLM v0.9.2 Quantized KV Cache docs (historical) — source for calculate_kv_scales on-the-fly description.
vLLM v0.10.1 Quantized KV Cache docs (historical) — source for scale-loading requirement at serve time.

Keywords: vLLM, FP8 KV cache, prefix caching, llm-compressor, calculate_kv_scales=True, kv_cache_dtype="fp8", PagedAttention, ROCm 6.3, AMD Radeon W7900, RDNA3, Llama 3.1, OpenAI-compatible server, quantized_kvcache, KV cache calibration

Was this guide helpful?

Share: X · LinkedIn · Reddit