How we compared GPTQ, AWQ, and SmoothQuant
The decision between GPTQ, AWQ, and SmoothQuant is not a quality horse race — it is a constraints problem. Each method targets a different quantization regime, and the right choice follows directly from your bit-width requirement, your GPU fleet, and your calibration budget. LMDeploy TurboMind supports INT4 inference from both AWQ and GPTQ, while its SmoothQuant path covers INT8 and FP8. TensorRT-LLM similarly surfaces all three methods, with documented examples for INT4-AWQ, INT4-GPTQ, and SmoothQuant.
| Dimension | GPTQ | AWQ | SmoothQuant |
|---|---|---|---|
| Weight bits | INT4 (W4A16) | INT4 (W4A16) | INT8 / FP8 (W8A8) |
| Activation bits | FP16 | FP16 | INT8 or FP8 |
| Calibration cost | ~4 GPU-hours for 175B models | Fast (128 samples, wikitext2) | Moderate (per-channel scale search) |
| Accuracy mechanism | Layer-wise Hessian reweighting | Salient-weight protection via scaling | Activation migration via smoothing factor α |
| Primary back-end support (LMDeploy) | TurboMind INT4 | TurboMind INT4 | INT8/FP8 path |
| FP8 capable | No | No | Yes (H100, Ada Lovelace) |
Comparison criteria that matter in production
SERP results for this topic fragment across paper summaries, framework docs, and serving-framework comparisons that rarely put accuracy, calibration cost, kernel support, and GPU fit in the same place. That fragmentation is the gap this article closes.
Pro Tip: Evaluate these four criteria in order before choosing a method: (1) perplexity impact — does the quantized model pass your downstream eval threshold? (2) calibration time — can you afford a multi-GPU-hour calibration run per model revision? (3) kernel availability — does your serving engine (TurboMind, TensorRT-LLM, vLLM) ship a kernel for that format on your GPU? (4) GPU fit — INT4 weight-only is memory-bound; INT8/FP8 is compute-bound. Match the method to the bottleneck.
GPTQ and AWQ both land in the W4A16 regime — weights compressed to INT4, activations kept at FP16. That shared regime means they compete head-to-head for memory-bandwidth-bound deployments. SmoothQuant is orthogonal: it compresses both weights and activations to INT8 or FP8, making it relevant for compute-bound workloads on modern accelerators.
Test conditions and deployment assumptions
Numbers from these three methods are not portable across serving stacks. A GPTQ perplexity figure measured under AutoGPTQ with a Marlin kernel does not translate directly to a TurboMind or TensorRT-LLM deployment without re-validating on the target engine.
Watch Out: Three variables invalidate cross-stack comparisons more often than algorithm differences: (1) Model size vs. available VRAM — a 70B model at INT4 still requires ~35 GB; fitting it on a 40 GB A100 leaves almost no room for KV cache at larger batch sizes. (2) Batch size during calibration — LMDeploy's docs advise setting
--batch-size 1under memory pressure; a higher batch size changes calibration statistics. (3) Kernel version — TensorRT-LLM ships kernel builds tied to specific CUDA and compute-capability targets; an INT4-AWQ kernel on sm86 is a different code path than the same-named method on sm75. Always pin your engine version when comparing.
GPTQ: when weight-only INT4 is the right call
GPTQ performs weight-only INT4 quantization using a layer-wise second-order approximation: it minimizes the reconstruction error for each layer's output by reweighting quantization error with the inverse Hessian of the layer inputs. The practical consequence is high accuracy preservation without touching activations, which run in FP16 throughout inference.
| Comparison factor | GPTQ | Why it matters |
|---|---|---|
| Calibration style | Layer-wise Hessian-guided weight rounding | Preserves sensitive weight directions without quantizing activations |
| Accuracy behavior | Minimal perplexity increase on tested 175B-scale models | Strong option when task evals punish small quality regressions |
| Runtime implications | On-the-fly INT4 → FP16 dequantization before GEMM | Keeps serving in W4A16 and fits memory-bandwidth-bound deployments |
| Calibration corpus | Small dataset such as wikitext2 or C4 | Keeps preparation lightweight relative to full fine-tuning |
| Supported back ends | TurboMind and TensorRT-LLM | Kernel coverage is available on documented GPU families |
Both LMDeploy TurboMind and TensorRT-LLM support GPTQ INT4 inference. The algorithm's accuracy behavior comes from its Hessian reweighting: salient weights that drive high output variance are rounded more carefully than weights with low sensitivity. This differs from AWQ's strategy of protecting only a small fixed fraction of salient weights via per-channel scaling.
Strengths for accuracy-sensitive serving
GPTQ's perplexity preservation is its defining characteristic. The paper reports quantizing OPT-175B and BLOOM-176B in approximately four GPU-hours with minimal perplexity increase on standard language modeling benchmarks. That makes it the method to reach for when accuracy-versus-bitwidth is the primary optimization axis and calibration time is not a bottleneck.
| Benchmark scenario | GPTQ tendency | Quality signal |
|---|---|---|
| Large models (70B+), single-pass calibration | Holds perplexity well | Paper-validated on 175B-scale models |
| Coding-heavy workloads | Competitive | Hessian weighting protects high-variance weight groups |
| Hard prompts with sharp token transitions | Often stable | Better fit than uniform scaling when a few channels dominate |
| Long-context inference | Risk of degradation | Context-length-matched evals still required |
| Models with activation outliers | Less sensitive than W8A8 methods | Activations run at FP16 |
The Hessian-based approach means GPTQ tends to handle weight distributions with high inter-layer variance better than methods that apply uniform per-group scaling. For coding and instruction-following models where certain weight channels carry disproportionate signal, this matters.
Where GPTQ can disappoint
GPTQ's accuracy claims are calibration-set and kernel-dependent. The paper's minimal-perplexity-increase result holds under its tested conditions; it does not extend unconditionally to all serving configurations.
Watch Out: Three failure modes are common in production GPTQ deployments: (1) Calibration sensitivity — the calibration corpus and sequence length affect quantization quality; using a dataset mismatched to your actual prompt distribution can increase perplexity on your eval set beyond acceptable thresholds. (2) Kernel dependence — TensorRT-LLM and LMDeploy ship different INT4 kernels for different compute capabilities; switching engines without re-validating accuracy can mask quality regressions. (3) Long-context degradation — W4A16 methods dequantize weights at inference time, and accumulated rounding error can compound over longer sequences; this is a deployment risk that requires context-length-matched evals, not just short-sequence perplexity.
AWQ: when faster calibration and broad INT4 serving support win
AWQ (Activation-aware Weight Quantization) reaches similar W4A16 accuracy to GPTQ through a different mechanism: it identifies the roughly 1% of weight channels most salient to output quality — determined by activation magnitude — and protects them from aggressive rounding via per-channel scaling, rather than computing full Hessian inverses. The result is a faster calibration procedure with competitive accuracy.
LMDeploy documents concrete defaults for AWQ calibration:
| Comparison factor | AWQ default | Why it matters |
|---|---|---|
| Dataset | wikitext2 | Matches the documented calibration recipe |
| Calibration samples | 128 | Fast enough for iterative re-quantization |
| Sequence length | 2048 | Preserves the documented context window |
Weight bits (w-bits) |
4 | Confirms the W4A16 deployment target |
Group size (w-group-size) |
128 | Maps to the representative kernel layout |
| Batch size | 1 | Recommended under memory pressure |
LMDeploy TurboMind supports AWQ INT4 inference on NVIDIA V100 (sm70), Turing/T4 (sm75), Ampere A100/A10/A30/A16 and 30-series (sm80/sm86), and Ada Lovelace (sm89). This breadth of GPU coverage — from V100 to Ada Lovelace — makes AWQ a practical default for teams whose fleet spans multiple GPU generations.
Why AWQ is attractive for quick iteration
AWQ's calibration requires 128 samples at sequence length 2048 with a batch size of 1 — a configuration that completes in minutes on a single GPU rather than hours. That speed advantage directly reduces the cost of iterating: re-quantizing after a fine-tuning run, testing group-size variants, or validating a new model family does not require blocking a multi-GPU node for hours.
Production Note: LMDeploy ships AWQ/GPTQ quantization as a single-command workflow, so you can quantize and export without stitching together a custom calibration script. If accuracy drops after quantization, the documented fallback is to enable
search-scale, which runs a scale-search pass to improve per-channel scaling factors. If memory is constrained during calibration, reduce--calib-seqlen, increase--calib-samplesto maintain statistical coverage, and keep--batch-size 1; these knobs trade calibration time for memory headroom without discarding the quantized model.
AWQ's per-channel scaling also makes its quantized weights more hardware-friendly: the INT4 layout with group size 128 maps cleanly to tensor core operations on Ampere and Ada Lovelace without requiring the full Hessian computation GPTQ performs.
Known limits and when AWQ is the wrong choice
AWQ is a weight-only INT4 method. It does not quantize activations. If your deployment target requires W8A8 — either because your backend is optimized for INT8 GEMM throughput or because you're targeting Hopper FP8 hardware paths — AWQ is structurally the wrong choice regardless of its calibration convenience.
Watch Out: Two backend-driven limits override AWQ's calibration advantages: (1) Activation quantization requirements — LMDeploy explicitly separates the AWQ/GPTQ INT4 path from the SmoothQuant INT8/FP8 path; if your serving target demands W8A8 (e.g., for TensorRT-LLM INT8 throughput optimizations or Hopper FP8 utilization), AWQ cannot fulfill that requirement. (2) Kernel availability — TensorRT-LLM and LMDeploy ship AWQ kernels for specific GPU families and engine builds; if your GPU or engine version lacks an AWQ kernel, the format label on your model weights does not help. Verify kernel availability against your exact engine release before committing to AWQ.
SmoothQuant: when INT8 or FP8 matters more than weight-only compression
SmoothQuant targets W8A8 quantization — compressing both weights and activations to INT8 — by migrating quantization difficulty from activations to weights. Transformer activations contain large outliers that make naive INT8 quantization lossy; SmoothQuant divides activation magnitudes by a per-channel smoothing factor α and multiplies the corresponding weight channel by the same factor, leaving the mathematical output unchanged while making both tensors easier to quantize at INT8 precision.
| Comparison factor | INT8 deployment target | FP8 deployment target |
|---|---|---|
| Core use case | W8A8 on INT8-capable back ends | FP8 inference on Hopper H100 and Ada Lovelace |
| Weight format | INT8 | FP8 |
| Activation format | INT8 | FP8 |
| LMDeploy support | Yes | Yes |
| Typical hardware fit | Ampere and later INT8 tensor core paths | H100 and Ada Lovelace FP8 tensor cores |
LMDeploy documents SmoothQuant as its dedicated INT8/FP8 inference path, separate from the AWQ/GPTQ INT4 route. The SmoothQuant paper validates W8A8 results on LLaMA-family models using WikiText-2 at sequence length 512, with per-token activation quantization and α=0.8 producing lossless perplexity relative to FP16 baselines.
Why SmoothQuant is the sensible option for activation quantization
The practical reason SmoothQuant exists is that pure weight-only INT4 does not help you on backends optimized for INT8 tensor core throughput. On Ampere and later hardware, INT8 GEMM throughput can significantly exceed FP16 GEMM throughput when both operands are quantized — but only if activations are also quantized. W4A16 methods like GPTQ and AWQ leave activations at FP16, so they do not unlock INT8 compute throughput. SmoothQuant does.
Pro Tip: SmoothQuant is purpose-built for W8A8. It is not a weight-only INT4 method, so if your serving objective is memory footprint reduction alone, GPTQ or AWQ will typically deliver better accuracy at INT4. If your objective is maximizing throughput on INT8-capable hardware — or accessing FP8 tensor cores on Hopper H100 — SmoothQuant is the method aligned to that hardware path. Framing it as "just another quantization option" misses why it was designed.
The smoothing factor α controls how much quantization difficulty is shifted from activations to weights. Higher α values shift more burden to weights; the paper's α=0.8 for LLaMA represents a tuned balance. In practice, different model families require different α values, which adds one calibration knob absent from the W4A16 methods.
GPU support and backend compatibility for SmoothQuant
SmoothQuant's INT8 path covers a broad set of NVIDIA GPU families in LMDeploy, while its FP8 path requires newer hardware.
| GPU family | Compute cap. | INT8 (SmoothQuant) | FP8 (SmoothQuant) | INT4 (AWQ/GPTQ) |
|---|---|---|---|---|
| V100 | sm70 | ✓ (LMDeploy) | ✗ | ✓ (LMDeploy) |
| Turing / T4 | sm75 | ✓ (LMDeploy) | ✗ | ✓ (LMDeploy) |
| Ampere (A100, A10, A30, A16, 30-series) | sm80/sm86 | ✓ (LMDeploy) | ✗ | ✓ (LMDeploy) |
| Ada Lovelace | sm89 | ✓ (LMDeploy) | ✓ (LMDeploy) | ✓ (LMDeploy) |
| Hopper H100 | sm90 | ✓ (LMDeploy) | ✓ (LMDeploy) | Not documented in LMDeploy w4a16 path |
TensorRT-LLM documents support for SmoothQuant alongside INT4-AWQ and INT4-GPTQ examples. FP8 support is specific to H100 and Ada Lovelace hardware; do not assume FP8 kernel availability on earlier GPU families even if the engine version claims FP8 support — verify against your specific engine build and GPU compute capability.
Benchmarks that separate quality from marketing
No single paper provides a controlled cross-method comparison of GPTQ, AWQ, and SmoothQuant on the same models with the same serving stack and the same hardware. The primary sources — the GPTQ paper, AWQ paper, SmoothQuant paper, and LMDeploy/TensorRT-LLM docs — each describe their own experimental setup. Treating their numbers as a unified leaderboard produces false conclusions.
| Method | Key accuracy result (per paper) | Calibration setup | Serving throughput data |
|---|---|---|---|
| GPTQ | Minimal perplexity increase; OPT-175B/BLOOM-176B quantized in ~4 GPU-hours | Layer-wise Hessian, wikitext2 | Not benchmarked in paper vs. AWQ/SmoothQuant |
| AWQ | Protects ~1% salient weights; group-128 INT4 representative setup | 128 samples, wikitext2, 2048 seqlen | Not directly compared to GPTQ in primary source |
| SmoothQuant | Lossless W8A8 on LLaMA/WikiText-2, seqlen 512, α=0.8 | Per-token activation quant, per-channel weight scale | Not normalized to INT4 methods; different bit regime |
The LMDeploy and TensorRT-LLM documentation provides deployment facts — which GPU families support which format, what calibration knobs exist — rather than model-by-model accuracy tables. That is the honest state of the primary documentation.
Calibration cost and re-quantization knobs
AWQ's calibration defaults in LMDeploy are concrete and fast: 128 samples from wikitext2 at sequence length 2048, group size 128, batch size 1. GPTQ's paper reports approximately four GPU-hours to quantize a 175B-scale model, though that figure covers the full quantization computation, not just calibration data loading.
| Parameter | AWQ (LMDeploy defaults) | GPTQ (paper reference) | SmoothQuant |
|---|---|---|---|
| Calibration samples | 128 | Not separately specified; full quantization ~4 GPU-hours for 175B | Per-channel scale search |
| Sequence length | 2048 (adjustable) | — | 512 (paper); deployment-specific |
| Batch size | 1 (OOM fallback: reduce seqlen, keep batch=1) | — | — |
| Accuracy fallback | search-scale re-quantization pass |
Re-run with different calibration set | Adjust α per model family |
| Memory relief | Reduce --calib-seqlen; increase --calib-samples |
— | — |
No verified direct timing comparison between AWQ and GPTQ calibration on identical hardware was retrieved from the primary sources. The inference should be directional: AWQ's 128-sample search is structurally lighter than GPTQ's Hessian inversion across all layers of a large model.
Kernel and GPU compatibility matrix
In LMDeploy and TensorRT-LLM, GPTQ and AWQ share the same INT4 GPU coverage. SmoothQuant extends INT8 support across the full documented range and adds FP8 on H100 and Ada Lovelace.
| GPU | AWQ INT4 | GPTQ INT4 | SmoothQuant INT8 | SmoothQuant FP8 |
|---|---|---|---|---|
| V100 (sm70) | ✓ | ✓ | ✓ | ✗ |
| Turing / T4 (sm75) | ✓ | ✓ | ✓ | ✗ |
| Ampere A100/A10/A30 (sm80/86) | ✓ | ✓ | ✓ | ✗ |
| Ada Lovelace (sm89) | ✓ | ✓ | ✓ | ✓ |
| Hopper H100 (sm90) | Not documented in LMDeploy w4a16 | Not documented in LMDeploy w4a16 | ✓ | ✓ |
Source: LMDeploy AWQ/GPTQ docs and LMDeploy SmoothQuant docs. TensorRT-LLM provides INT4-AWQ, INT4-GPTQ, and SmoothQuant examples; verify kernel availability against your specific TensorRT-LLM release version and target compute capability.
Decision matrix for choosing the right quantization path
For a Llama 70B serving deployment, the answer is conditional on three variables: your GPU hardware, whether you need activation quantization, and how often you re-quantize. There is no universal winner across all three methods — the right choice follows from constraints, not benchmarks alone.
| Workload / constraint | Recommended method | Reasoning |
|---|---|---|
| Llama 70B on A100 40GB, memory-limited | AWQ or GPTQ INT4 | Fits ~35GB quantized; leaves headroom for KV cache |
| Llama 70B on H100, throughput-first | SmoothQuant FP8 | FP8 tensor cores on H100; both W and A quantized |
| Frequent re-quantization after fine-tuning | AWQ | Faster calibration; single-command workflow |
| Accuracy-first, calibration time available | GPTQ | Hessian-based rounding; validated at 175B scale |
| V100 or T4 fleet, INT4 | AWQ or GPTQ | Both documented on sm70/sm75 in LMDeploy |
| Ada Lovelace, FP8 target | SmoothQuant | FP8 supported on sm89 per LMDeploy docs |
| Backend mandates W8A8 (e.g., INT8 GEMM) | SmoothQuant | Only method targeting activation quantization |
Choose GPTQ when…
GPTQ is the right choice when accuracy preservation under INT4 is non-negotiable and you can afford the calibration compute. The method's Hessian-guided weight rounding produces minimal perplexity increase even at 175B scale, which matters for production deployments where quality regressions are caught by downstream evals rather than just perplexity. GPTQ is documented across TensorRT-LLM and LMDeploy TurboMind, so kernel coverage is not a limiting factor on the GPU families both backends support.
Choose GPTQ when: your team runs perplexity and task evals after every quantization run; calibration time (hours on a single GPU for 70B-scale models) fits your workflow; and you need the highest accuracy available at INT4 weight-only precision for hard prompts or coding tasks.
Choose AWQ when…
AWQ wins on iteration speed and GPU breadth. Its 128-sample wikitext2 calibration at sequence length 2048 completes fast enough to re-quantize after fine-tuning runs without blocking GPU capacity. LMDeploy documents AWQ INT4 inference from V100 through Ada Lovelace, making it the practical default for teams whose GPU fleet spans multiple generations.
Choose AWQ when: you need to quantize frequently or quickly; you are fitting a model onto a smaller GPU (A10, T4, 30-series consumer) where INT4 is the only option that fits; and your backend — LMDeploy TurboMind or TensorRT-LLM — has a verified AWQ kernel for your compute capability.
Choose SmoothQuant when…
SmoothQuant is the only sensible option when your deployment requires activation quantization. W4A16 methods like GPTQ and AWQ cannot unlock INT8 or FP8 compute throughput because activations remain at FP16. If you are targeting Hopper H100 FP8 tensor cores, running an INT8-optimized backend, or serving at batch sizes where compute rather than memory bandwidth is the bottleneck, SmoothQuant is structurally the correct method.
Choose SmoothQuant when: your hardware is H100 or Ada Lovelace and FP8 utilization matters; your backend requires W8A8 rather than W4A16; or your model's activation distribution responds well to the smoothing factor migration (validated for LLaMA-family models in the paper).
FAQ
Pro Tip: Before finalizing your deployment choice, verify kernel support and calibration defaults against the latest release of your chosen backend — LMDeploy w4a16 docs, LMDeploy w8a8 docs, or the TensorRT-LLM quantization docs. Kernel support evolves with engine releases; a GPU family listed in docs from six months ago may have additional formats in the current release.
Is AWQ faster than GPTQ for serving?
Faster calibration does not imply faster inference. AWQ's ~128-sample calibration is structurally lighter than GPTQ's layer-wise Hessian computation, but serving throughput depends on the kernel implementation in your serving engine, not the calibration algorithm. GPTQ's paper reports roughly four GPU-hours for a 175B-scale quantization run — a calibration-time figure, not a serving-time throughput number. No authoritative cross-stack serving throughput benchmark comparing AWQ and GPTQ was available from the primary sources at the time of publication. Both formats produce W4A16 weights; on the same GPU with the same kernel, throughput differences come from weight layout and kernel implementation, not algorithm labels. See the Kernel and GPU Compatibility Matrix for the GPU families where both are supported.
Do GPTQ and AWQ need the same kernel support?
In LMDeploy TurboMind, GPTQ and AWQ share the same documented INT4 GPU coverage: V100 (sm70), Turing/T4 (sm75), Ampere (sm80/sm86), and Ada Lovelace (sm89). In TensorRT-LLM, both appear as documented INT4 weight-only quantization paths. However, the kernel implementations differ — AWQ uses per-channel scaling; GPTQ uses Hessian-guided rounding — so their internal kernel code paths are distinct even when the supported GPU list overlaps. An engine build that supports INT4-AWQ does not automatically support INT4-GPTQ and vice versa; check your specific engine version. See the Kernel and GPU Compatibility Matrix.
Is SmoothQuant only for INT8?
No. SmoothQuant covers both INT8 (W8A8) and FP8 deployment targets. Its core mechanism — migrating activation quantization difficulty to weights via per-channel scaling — applies to both precision levels. LMDeploy documents SmoothQuant as its INT8/FP8 inference path, with FP8 support on Hopper H100 and Ada Lovelace. The SmoothQuant paper's primary results use W8A8 with per-token INT8 activation quantization on LLaMA models. SmoothQuant is not an INT4 weight-only method; if your target is W4A16 compression, use GPTQ or AWQ.
Sources and references
- LMDeploy AWQ/GPTQ Quantization Documentation (w4a16) — Primary source for calibration defaults, GPU family support, and TurboMind INT4 inference configuration
- LMDeploy SmoothQuant Documentation (w8a8) — Primary source for INT8/FP8 deployment targets, Hopper H100 FP8 support, and GPU family coverage
- LMDeploy CLI Reference — Calibration sequence length defaults and memory-constrained batch size guidance
- TensorRT-LLM Quantization Blog — Backend support for INT4-AWQ, INT4-GPTQ, and SmoothQuant examples
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (arXiv 2210.17323) — Original GPTQ paper; source for Hessian-based quantization method, ~4 GPU-hour calibration result, perplexity claims on OPT-175B and BLOOM-176B
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (arXiv 2306.00978) — Original AWQ paper; source for salient-weight protection mechanism, hardware-friendly INT4 layout, group-size-128 representative setup
- SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (arXiv 2211.10438) — Original SmoothQuant paper; source for W8A8 activation migration mechanism, α=0.8 for LLaMA, WikiText-2 sequence-512 results
Keywords
GPTQ, AWQ, SmoothQuant, TensorRT-LLM, LMDeploy TurboMind, INT4, INT8, FP8, Wikitext2, Marlin kernel, NVIDIA V100, NVIDIA T4, NVIDIA A100, NVIDIA H100, Ada Lovelace GPUs


