AI & ML

QServe and the case for W4A8KV4: what recent LLM serving benchmarks say about low-bit GPU inference

QServe’s W4A8KV4 path is compelling because it reduces dequantization overheads while preserving quality, and the OmniServe integration shows how that low-bit pipeline combines with sparse attention to maximize throughput — but the benefit is tied to GPU-serving stacks that can actually execute the fused kernels.

By AxiomLogica Editorial

May 1, 202616 min read

Reviewed by Editorial

QServe and the case for W4A8KV4: what recent LLM serving benchmarks say about low-bit GPU inference

What QServe's W4A8KV4 benchmarks actually show

QServe is a GPU serving system built around a mixed-precision quantization scheme called QoQ (Quattuor-Octo-Quattuor — 4-8-4 in Latin). As the QServe paper states: "To achieve optimal LLM serving throughput, we introduce QoQ... which quantizes LLMs to W4A8KV4 precision: 4-bit weights, 8-bit activations and 4-bit KV caches." (arXiv:2405.04532). The scheme is not a generic compression technique — it is a serving-system co-design where the quantization format is chosen specifically to exploit fused GPU kernels rather than to minimize model file size.

The benchmark picture is strongest at large model scale. Across seven representative LLMs on NVIDIA A100 and L40S GPUs, the QServe paper reports up to 2.4×–3.5× higher throughput than TensorRT-LLM. On Llama 3.1 8B the gain is more modest: 1.2×–1.4×. The headroom widens substantially at 72B parameters, where memory bandwidth pressure makes the KV4 savings most consequential.

Bottom Line: QServe's W4A8KV4 delivers material throughput gains over TensorRT-LLM — up to 3.5× on large models like Qwen1.5-72B on A100 and L40S — with quality preserved by SmoothAttention; the gains are real but contingent on fused-kernel support and shrink toward 1.2×–1.4× on smaller models, so the scheme matters most when memory bandwidth, not compute, is the bottleneck.

How the benchmarked serving pipeline is set up

The QServe benchmarks run on NVIDIA datacenter GPUs — specifically A100 and L40S — using a pipeline that combines W4A8KV4 quantization with compute-aware weight reordering and register-level parallelism. As the paper explains: "In the QServe system, we perform compute-aware weight reordering and take advantage of register-level parallelism to reduce dequantization latency." (OpenReview PDF). These are not post-hoc optimizations; they are baked into the kernel design.

The OmniServe repository combines QServe (W4A8KV4 quantization) with LServe (unified sparse attention and hierarchical KV cache management) into a single serving system. The two subsystems target different bottlenecks: QServe attacks weight and activation memory costs, LServe addresses sequence-length scaling.

Serving Stack	Quantization	KV Cache	Sparse Attention	Primary Target
QServe (OmniServe)	W4A8KV4	4-bit, hierarchical	Yes (LServe)	Throughput on A100/L40S
TensorRT-LLM	FP8/INT8/FP16 (configurable)	FP8/FP16	Limited	Broad NVIDIA deployment
vLLM (baseline)	FP16/BF16 default	FP16	PagedAttention	General-purpose serving

For benchmark reproduction, the OmniServe repo requires Python 3.10, the CUDA toolkit installed via conda, pip install -e ., and flash-attn. The repo warns that strict PyTorch/wheel version matching is necessary; minor CUDA version mismatches are usually acceptable but PyTorch wheel mismatches are not. The setup notes also align with PyTorch 2.5 compatibility and the flash-attn package used in the repo's installation path.

Production Note: QServe's throughput claims depend on fused kernel execution paths. Deploying on GPUs that lack support for the specialized W4A8 GEMM kernels, or running through a generic CUDA runtime that bypasses the fused path, will not reproduce the reported numbers. Before committing to QServe in production, verify fused-kernel support on your specific GPU SKU and driver version.

Why dequantization overhead dominates low-bit serving

The naive expectation for low-bit quantization is straightforward: fewer bits per weight means lower memory bandwidth consumption, which should translate directly to higher throughput. In practice, the bottleneck often shifts from memory bandwidth to dequantization compute — the CUDA-core work required to convert stored INT4 weights back to a higher-precision format before or during matrix multiplication.

Standard W4A16 serving stacks (quantize weights to 4-bit, run activations in FP16) concentrate this dequantization cost on CUDA cores rather than Tensor Cores, creating a low-throughput scalar path that can erase most of the bandwidth savings. QServe's W4A8KV4 approach sidesteps this by using 8-bit activations, which enables INT8 Tensor Core execution for the GEMM itself. The paper explicitly frames QoQ as a system that "transfers theoretical memory saving brought by KV4 attention into measured speedup." (OpenReview PDF) — the KV4 savings only become real throughput gains when the serving stack handles the dequantization path with fused kernels.

Pro Tip: The distinction between W4A16 and W4A8 is not just numerical precision — it is a kernel execution path decision. W4A16 forces dequantization to FP16 before the GEMM, keeping the work on CUDA cores. W4A8 allows the GEMM to run on INT8 Tensor Cores with in-kernel dequantization, which is why QServe's W4A8KV4 can outperform naive W4A16 stacks even though both use 4-bit stored weights.

Benchmark results for prefill and decode

QServe's published throughput numbers reflect an aggregate across seven LLM configurations on two GPU classes. The accessible public sources report relative throughput multiples rather than a decomposed prefill/decode latency breakdown, so the table below presents the verified deltas without fabricating per-stage absolute values.

Model	GPU	Benchmark signal	Relative throughput / latency figure	Source
Llama 3.1 8B	A100 / L40S	End-to-end throughput	1.2×–1.4× higher throughput than TensorRT-LLM	OmniServe README
Qwen1.5-72B	A100 / L40S	End-to-end throughput	2.4×–3.5× higher throughput than TensorRT-LLM	OmniServe README
7 representative LLMs (aggregate)	A100 + L40S	End-to-end throughput	up to 2.4×–3.5× higher	arXiv:2405.04532
QServe on L40S vs. TensorRT-LLM on A100	Cross-GPU	Throughput comparison	QServe L40S exceeds TRT A100	arXiv abstract

The cross-GPU finding — "QServe on L40S GPU can achieve even higher throughput than TensorRT-LLM on A100" — is the most practically significant result for cost-sensitive deployments. L40S carries a lower datacenter cost than A100 while delivering competitive or superior serving throughput under QServe, which directly changes the cost-per-token calculation for large-model deployments.

The public sources do not expose a complete prefill-versus-decode latency breakdown. Based on where the W4A8KV4 scheme applies pressure — KV4 reduces KV cache memory, which primarily relieves bandwidth at decode where the cache dominates memory traffic — the decode stage likely captures a disproportionate share of the gain. Prefill throughput improvements are more dependent on the W4A8 GEMM efficiency.

Where QServe wins on throughput

The strongest and most reproducible gains appear on large models running on the two benchmarked GPU classes.

Model size	Throughput gain vs. TensorRT-LLM	Primary driver
~8B parameters (Llama 3.1 8B)	1.2×–1.4×	W4A8 GEMM kernel efficiency
~72B parameters (Qwen1.5-72B)	2.4×–3.5×	KV4 bandwidth relief + W4A8 GEMM

The pattern is consistent with physical intuition: at 72B parameters, the KV cache at moderate batch sizes consumes a larger fraction of HBM bandwidth, so the KV4 reduction has a proportionally larger impact. At 8B parameters, the model fits more readily in GPU memory and the serving bottleneck shifts toward compute, where the W4A8 Tensor Core path still helps but does not dominate as dramatically.

As the OmniServe README states: "Compared with leading industry solution TensorRT-LLM, QServe achieves 1.2x-1.4x higher throughput when serving Llama-3-8B, and 2.4x-3.5x higher throughput when serving Qwen1.5-72B, on L40S and A100 GPUs." (OmniServe README)

Where the gains narrow or disappear

The benchmarked improvements are not universal. Several conditions compress or eliminate the advantage.

Kernel fusion is the most critical dependency. QServe's gains come from compute-aware weight reordering and register-level parallelism — both require that the runtime executes the fused kernel path. If any operator in the forward pass falls back to an unfused implementation (for example, a nonstandard attention variant, a custom positional encoding, or an unsupported activation function), the dequantization cost returns to the scalar CUDA-core path and the throughput delta collapses.

At smaller batch sizes and shorter sequence lengths, memory bandwidth pressure decreases, reducing the relative benefit of KV4. When serving a chatbot with single-turn, short-context requests, the KV cache is small and the bandwidth savings are marginal — the workload profile shifts the bottleneck toward prefill compute, where the W4A8 gain is smaller.

Watch Out: If your production workload uses nonstandard attention masks, sliding-window attention, custom RoPE variants, or model architectures not covered by QServe's operator library, the fused kernel path may not activate. Always benchmark the actual operator coverage for your model before treating the published throughput multipliers as guaranteed. W4A8KV4 quantization alone — applied through a generic runtime — does not reproduce QServe's results.

What W4A8KV4 changes in quality and accuracy

W4A8KV4's accuracy story is more nuanced than typical INT4 quantization. The 4-bit weight component carries well-understood quality risks at this point in the literature. The more sensitive decision is quantizing the KV cache to 4 bits, which compresses the attention context that determines output quality at inference time.

The QServe paper addresses KV4 quality risk directly by introducing SmoothAttention. As the abstract states: "Additionally, we develop SmoothAttention to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization." (arXiv:2405.04532) The technique migrates quantization-sensitive outliers in the attention computation to reduce the effective quantization error in the KV cache without abandoning the 4-bit storage format.

Quantization scheme	Weight bits	Activation bits	KV cache bits	KV quality mitigation
W4A16KV16 (standard INT4)	4	16	16	None needed
W8A8KV8	8	8	8	None typically needed
W4A8KV4 (QServe/QoQ)	4	8	4	SmoothAttention required
W4A4KV4	4	4	4	Significant degradation risk

The public snippets do not expose the complete accuracy benchmark tables (perplexity, task-specific accuracy scores, or exact-match results), so the quality claim should be treated as a paper assertion supported by SmoothAttention's design rationale, not a verified independent reproduction. The paper's framing — positioning W4A8KV4 as maintaining response quality — implies that SmoothAttention keeps quality on par with higher-precision baselines, but engineers deploying on sensitive workloads should run their own accuracy evaluations on the target model family.

Why the 4-bit weights, 8-bit activations, and 4-bit KV cache combination matters

Each component of W4A8KV4 targets a different resource bottleneck, and the combination is specifically designed for serving rather than offline inference.

4-bit weights reduce the model's memory footprint and lower the bandwidth cost of loading weight tensors during the GEMM. At 72B parameters this is substantial: a BF16 model requires ~144 GB, INT4 roughly ~36 GB, enabling single-node deployment on hardware that would otherwise require tensor parallelism across more GPUs.

8-bit activations are the key to keeping the GEMM on INT8 Tensor Cores. FP16 activations with INT4 weights require dequantizing to FP16 before multiplication, which moves the compute to CUDA cores. INT8 activations retain Tensor Core execution, and the quantization error introduced by INT8 activations is more controllable than INT4 activations because the dynamic range is sufficient for most transformer activation distributions.

4-bit KV caches directly reduce the HBM bandwidth consumed by attention at decode time. In long-context or high-concurrency serving, the KV cache dominates memory traffic. Halving the KV cache size from 8-bit to 4-bit nearly doubles the effective KV cache capacity for a given HBM budget, enabling larger batch sizes or longer contexts.

Pro Tip: The QoQ scheme is ordered by serving impact, not model quality sensitivity: KV4 has the largest throughput effect at high concurrency (reduces HBM bandwidth for attention), W4 has the largest footprint effect (enables single-node large-model deployment), and A8 is the enabling condition that keeps the GEMM on Tensor Cores rather than forcing a CUDA-core dequantize-then-multiply path. Remove any one component and the combined gain degrades non-linearly.

How QServe compares with TensorRT-LLM in practice

TensorRT-LLM is NVIDIA's supported production inference library for LLMs. QServe outperforms it on the benchmarked configurations — sometimes substantially — but the comparison is conditional on several factors that matter for real deployments.

The throughput advantage is model-size-dependent and GPU-dependent, ranging from 1.2×–1.4× on Llama 3.1 8B to 2.4×–3.5× on Qwen1.5-72B across L40S and A100. The most striking cross-hardware result — QServe on L40S exceeding TensorRT-LLM on A100 — has direct implications for infrastructure cost. L40S is positioned as a less expensive datacenter GPU than A100; if it delivers higher model throughput under QServe, the cost-per-token advantage compounds both the hardware cost difference and the throughput gain.

Choose QServe when... - Serving large models (≥70B parameters) where KV cache bandwidth is the primary bottleneck - Deploying on A100 or L40S with fused-kernel support confirmed for your model architecture - Cost-per-token is the primary optimization target and you can validate operator coverage - Your model family is among the seven tested in the paper (Llama, Qwen, and similar transformer architectures)

Choose TensorRT-LLM when... - You need broad NVIDIA ecosystem compatibility and first-party benchmark tooling (trtllm-bench) - Your model uses nonstandard operators, custom attention variants, or architectures outside QServe's current coverage - Your organization requires a supported production library with SLA-backed NVIDIA engineering support - You are serving smaller models (≤8B) where the QServe throughput delta is 1.2×–1.4× and deployment friction may not justify the switch

Choose another stack when... - GPU vendor diversity or non-NVIDIA hardware is required - Dynamic batching and continuous batching with minimal configuration overhead is the priority

When TensorRT-LLM is still the safer choice

Raw benchmark wins do not transfer to production when the serving stack cannot cover the model's operator graph. TensorRT-LLM has been tuned and validated across a broad range of NVIDIA GPU SKUs, driver versions, and model architectures. NVIDIA's benchmarking guidance covers latency, throughput, tokens/sec/user, and concurrency planning via trtllm-bench across a documented hardware matrix — a reproducibility and support surface that QServe, as an MIT research system, does not currently match.

Operators that fall outside the QServe fused-kernel library will degrade performance or fail silently. TensorRT-LLM's compilation model catches most of these at build time. For a production deployment where a silent fallback to an unfused path would cause a capacity incident, the earlier failure mode is preferable.

Watch Out: QServe's fused kernels are currently designed for specific GPU architectures (A100, L40S) and tested model families. Deploying on H100, RTX-class consumer GPUs, or models with nonstandard architectures without first confirming kernel coverage risks running the quantized weights through an unfused path — which may actually reduce throughput compared to TensorRT-LLM's FP8 path on the same hardware.

Where QServe is most compelling for constrained hardware

The A100-vs-L40S cross-GPU result is the clearest cost-efficiency argument for QServe. An L40S GPU runs at a lower per-hour datacenter cost than an A100, yet under QServe it can serve Qwen1.5-72B at higher throughput than TensorRT-LLM achieves on A100.

GPU	Serving stack	Model	Relative throughput
NVIDIA L40S	QServe (W4A8KV4)	Qwen1.5-72B	Higher than TRT-LLM/A100
NVIDIA A100	TensorRT-LLM	Qwen1.5-72B	Baseline
NVIDIA A100	QServe (W4A8KV4)	Qwen1.5-72B	2.4×–3.5× above TRT-LLM/A100
NVIDIA A100 / L40S	QServe (W4A8KV4)	Llama 3.1 8B	1.2×–1.4× above TRT-LLM

For an operator running A100s at full utilization for large-model serving, QServe on the same hardware offers the clearest argument: the throughput gain directly reduces the number of nodes required at a given request rate.

Implications for applied ML engineers and chip teams

The QServe benchmark evidence supports a specific claim: quantization format selection is a kernel execution path decision, not just a model compression decision. The OmniServe architecture makes this explicit by combining QServe (W4A8KV4) and LServe (sparse attention, hierarchical KV cache) into a unified system — neither component alone achieves the full throughput gain.

For applied ML engineers, the decision framework maps to three variables: model size, GPU hardware, and operator coverage.

Condition	Recommended approach
Large model (≥70B), A100/L40S, standard transformer architecture	QServe W4A8KV4 — strongest throughput gain
Large model (≥70B), H100, nonstandard ops	TensorRT-LLM FP8 — broader kernel coverage, lower risk
Small model (≤8B), any NVIDIA GPU	TensorRT-LLM or vLLM — QServe gain is modest; maturity wins
Long-context serving (>32K tokens), A100/L40S	OmniServe (QServe + LServe) — KV cache management is the bottleneck
Non-NVIDIA GPU target	Neither — use hardware-native stack

For chip teams evaluating new silicon, QServe's results are evidence for a specific design hypothesis: INT8 Tensor Core paths with in-kernel INT4 dequantization extract more deployed throughput than INT4-only paths that push dequantization to CUDA cores. New architectures should expose dedicated low-latency dequantization units or ensure that INT8 Tensor Core bandwidth is sufficient to absorb the W4→A8 conversion without a separate scalar stage.

The broader engineering lesson from OmniServe is architectural: quantization alone is not enough. The runtime must co-design around the quantization format — handling KV cache layout, attention tiling, weight reordering, and register scheduling as a coherent system rather than bolting quantization onto an existing FP16 pipeline.

FAQ on QServe, W4A8KV4, and low-bit serving

What is W4A8KV4 quantization?

W4A8KV4 stores model weights at 4-bit precision, runs activations at 8-bit precision, and stores the KV cache at 4-bit precision. As the QServe paper defines it: "4-bit weights, 8-bit activations and 4-bit KV caches." (arXiv:2405.04532) The 8-bit activation choice is deliberate — it preserves INT8 Tensor Core execution for the GEMM while still benefiting from 4-bit weight storage.

Is QServe faster than TensorRT-LLM?

On the benchmarked configurations — A100 and L40S GPUs, Llama 3.1 8B and Qwen1.5-72B — yes, by 1.2×–1.4× and 2.4×–3.5× respectively. The gain is not universal; it depends on fused-kernel support for your model architecture and GPU.

How does low-bit quantization affect LLM accuracy?

W4A8KV4 introduces KV4 as the riskiest accuracy trade-off. QServe addresses this with SmoothAttention, which the paper claims effectively mitigates the accuracy degradation from 4-bit KV quantization. Exact perplexity or task-accuracy numbers are in the full paper tables but not exposed in the public abstract snippets — treat quality preservation as a design goal validated by SmoothAttention, and run your own evaluations for production use.

What are the limits of GPU inference optimization?

The primary limit is operator coverage: fused kernels only accelerate operators they are written for. Nonstandard architectures, custom attention variants, or GPU SKUs outside the tested matrix will fall back to unfused paths, erasing the throughput gain. Secondary limits are memory capacity (you need enough HBM to load the quantized model and its activation buffers) and version-sensitive software dependencies (strict PyTorch/CUDA wheel matching in the OmniServe setup).

Watch Out: Benchmarks for QServe are reported for specific GPU/model combinations. Do not extrapolate the 3.5× figure to all models or all GPUs. A workload on a GPU not in the tested matrix, or a model with operators outside QServe's kernel library, may see no improvement or a regression relative to TensorRT-LLM.

Sources & References

QServe: W4A8KV4 Quantization for Efficient LLM Serving (arXiv:2405.04532) — Primary paper defining QoQ/W4A8KV4 and reporting throughput benchmarks vs. TensorRT-LLM
QServe arXiv PDF — Full paper PDF with aggregate throughput figures
QServe / OmniServe OpenReview PDF — System design details: compute-aware weight reordering, register-level parallelism, dequantization latency
OmniServe GitHub Repository (mit-han-lab/omniserve) — Source code, setup instructions, and per-model throughput benchmarks for QServe and LServe
OmniServe README — Per-model benchmark table (Llama 3.1 8B, Qwen1.5-72B) and environment setup requirements
NVIDIA Developer Blog: LLM Inference Benchmarking and Performance Tuning with TensorRT-LLM — TensorRT-LLM benchmark methodology, trtllm-bench tooling, and deployment planning guidance

Keywords: QServe, W4A8KV4, QoQ, TensorRT-LLM, OmniServe, LServe, Llama 3.1 8B, Qwen1.5-72B, NVIDIA A100, NVIDIA L40S, CUDA, PyTorch 2.5, flash-attn, KV cache, W4A8 GEMM

Was this guide helpful?

Share: X · LinkedIn · Reddit