Skip to content
AxiomLogicaSearch
AI & ML

GPTQ vs AWQ vs SmoothQuant for LLM serving: which quantization method should you choose?

GPTQ is strongest for high-accuracy weight-only INT4, AWQ is typically faster to calibrate and often competitive on quality, and SmoothQuant is the method purpose-built for W8A8 — but the best choice hinges on whether you need weight-only compression, activation quantization, or the broadest kernel support.

GPTQ vs AWQ vs SmoothQuant for LLM serving: which quantization method should you choose?
GPTQ vs AWQ vs SmoothQuant for LLM serving: which quantization method should you choose?

How we compared GPTQ, AWQ, and SmoothQuant

The decision between GPTQ, AWQ, and SmoothQuant is not a quality horse race — it is a constraints problem. Each method targets a different quantization regime, and the right choice follows directly from your bit-width requirement, your GPU fleet, and your calibration budget. LMDeploy TurboMind supports INT4 inference from both AWQ and GPTQ, while its SmoothQuant path covers INT8 and FP8. TensorRT-LLM similarly surfaces all three methods, with documented examples for INT4-AWQ, INT4-GPTQ, and SmoothQuant.

Dimension GPTQ AWQ SmoothQuant
Weight bits INT4 (W4A16) INT4 (W4A16) INT8 / FP8 (W8A8)
Activation bits FP16 FP16 INT8 or FP8
Calibration cost ~4 GPU-hours for 175B models Fast (128 samples, wikitext2) Moderate (per-channel scale search)
Accuracy mechanism Layer-wise Hessian reweighting Salient-weight protection via scaling Activation migration via smoothing factor α
Primary back-end support (LMDeploy) TurboMind INT4 TurboMind INT4 INT8/FP8 path
FP8 capable No No Yes (H100, Ada Lovelace)

Comparison criteria that matter in production

SERP results for this topic fragment across paper summaries, framework docs, and serving-framework comparisons that rarely put accuracy, calibration cost, kernel support, and GPU fit in the same place. That fragmentation is the gap this article closes.

Pro Tip: Evaluate these four criteria in order before choosing a method: (1) perplexity impact — does the quantized model pass your downstream eval threshold? (2) calibration time — can you afford a multi-GPU-hour calibration run per model revision? (3) kernel availability — does your serving engine (TurboMind, TensorRT-LLM, vLLM) ship a kernel for that format on your GPU? (4) GPU fit — INT4 weight-only is memory-bound; INT8/FP8 is compute-bound. Match the method to the bottleneck.

GPTQ and AWQ both land in the W4A16 regime — weights compressed to INT4, activations kept at FP16. That shared regime means they compete head-to-head for memory-bandwidth-bound deployments. SmoothQuant is orthogonal: it compresses both weights and activations to INT8 or FP8, making it relevant for compute-bound workloads on modern accelerators.

Test conditions and deployment assumptions

Numbers from these three methods are not portable across serving stacks. A GPTQ perplexity figure measured under AutoGPTQ with a Marlin kernel does not translate directly to a TurboMind or TensorRT-LLM deployment without re-validating on the target engine.

Watch Out: Three variables invalidate cross-stack comparisons more often than algorithm differences: (1) Model size vs. available VRAM — a 70B model at INT4 still requires ~35 GB; fitting it on a 40 GB A100 leaves almost no room for KV cache at larger batch sizes. (2) Batch size during calibration — LMDeploy's docs advise setting --batch-size 1 under memory pressure; a higher batch size changes calibration statistics. (3) Kernel version — TensorRT-LLM ships kernel builds tied to specific CUDA and compute-capability targets; an INT4-AWQ kernel on sm86 is a different code path than the same-named method on sm75. Always pin your engine version when comparing.

GPTQ: when weight-only INT4 is the right call

GPTQ performs weight-only INT4 quantization using a layer-wise second-order approximation: it minimizes the reconstruction error for each layer's output by reweighting quantization error with the inverse Hessian of the layer inputs. The practical consequence is high accuracy preservation without touching activations, which run in FP16 throughout inference.

Comparison factor GPTQ Why it matters
Calibration style Layer-wise Hessian-guided weight rounding Preserves sensitive weight directions without quantizing activations
Accuracy behavior Minimal perplexity increase on tested 175B-scale models Strong option when task evals punish small quality regressions
Runtime implications On-the-fly INT4 → FP16 dequantization before GEMM Keeps serving in W4A16 and fits memory-bandwidth-bound deployments
Calibration corpus Small dataset such as wikitext2 or C4 Keeps preparation lightweight relative to full fine-tuning
Supported back ends TurboMind and TensorRT-LLM Kernel coverage is available on documented GPU families

Both LMDeploy TurboMind and TensorRT-LLM support GPTQ INT4 inference. The algorithm's accuracy behavior comes from its Hessian reweighting: salient weights that drive high output variance are rounded more carefully than weights with low sensitivity. This differs from AWQ's strategy of protecting only a small fixed fraction of salient weights via per-channel scaling.

Strengths for accuracy-sensitive serving

GPTQ's perplexity preservation is its defining characteristic. The paper reports quantizing OPT-175B and BLOOM-176B in approximately four GPU-hours with minimal perplexity increase on standard language modeling benchmarks. That makes it the method to reach for when accuracy-versus-bitwidth is the primary optimization axis and calibration time is not a bottleneck.

Benchmark scenario GPTQ tendency Quality signal
Large models (70B+), single-pass calibration Holds perplexity well Paper-validated on 175B-scale models
Coding-heavy workloads Competitive Hessian weighting protects high-variance weight groups
Hard prompts with sharp token transitions Often stable Better fit than uniform scaling when a few channels dominate
Long-context inference Risk of degradation Context-length-matched evals still required
Models with activation outliers Less sensitive than W8A8 methods Activations run at FP16

The Hessian-based approach means GPTQ tends to handle weight distributions with high inter-layer variance better than methods that apply uniform per-group scaling. For coding and instruction-following models where certain weight channels carry disproportionate signal, this matters.

Where GPTQ can disappoint

GPTQ's accuracy claims are calibration-set and kernel-dependent. The paper's minimal-perplexity-increase result holds under its tested conditions; it does not extend unconditionally to all serving configurations.

Watch Out: Three failure modes are common in production GPTQ deployments: (1) Calibration sensitivity — the calibration corpus and sequence length affect quantization quality; using a dataset mismatched to your actual prompt distribution can increase perplexity on your eval set beyond acceptable thresholds. (2) Kernel dependence — TensorRT-LLM and LMDeploy ship different INT4 kernels for different compute capabilities; switching engines without re-validating accuracy can mask quality regressions. (3) Long-context degradation — W4A16 methods dequantize weights at inference time, and accumulated rounding error can compound over longer sequences; this is a deployment risk that requires context-length-matched evals, not just short-sequence perplexity.

AWQ: when faster calibration and broad INT4 serving support win

AWQ (Activation-aware Weight Quantization) reaches similar W4A16 accuracy to GPTQ through a different mechanism: it identifies the roughly 1% of weight channels most salient to output quality — determined by activation magnitude — and protects them from aggressive rounding via per-channel scaling, rather than computing full Hessian inverses. The result is a faster calibration procedure with competitive accuracy.

LMDeploy documents concrete defaults for AWQ calibration:

Comparison factor AWQ default Why it matters
Dataset wikitext2 Matches the documented calibration recipe
Calibration samples 128 Fast enough for iterative re-quantization
Sequence length 2048 Preserves the documented context window
Weight bits (w-bits) 4 Confirms the W4A16 deployment target
Group size (w-group-size) 128 Maps to the representative kernel layout
Batch size 1 Recommended under memory pressure

LMDeploy TurboMind supports AWQ INT4 inference on NVIDIA V100 (sm70), Turing/T4 (sm75), Ampere A100/A10/A30/A16 and 30-series (sm80/sm86), and Ada Lovelace (sm89). This breadth of GPU coverage — from V100 to Ada Lovelace — makes AWQ a practical default for teams whose fleet spans multiple GPU generations.

Why AWQ is attractive for quick iteration

AWQ's calibration requires 128 samples at sequence length 2048 with a batch size of 1 — a configuration that completes in minutes on a single GPU rather than hours. That speed advantage directly reduces the cost of iterating: re-quantizing after a fine-tuning run, testing group-size variants, or validating a new model family does not require blocking a multi-GPU node for hours.

Production Note: LMDeploy ships AWQ/GPTQ quantization as a single-command workflow, so you can quantize and export without stitching together a custom calibration script. If accuracy drops after quantization, the documented fallback is to enable search-scale, which runs a scale-search pass to improve per-channel scaling factors. If memory is constrained during calibration, reduce --calib-seqlen, increase --calib-samples to maintain statistical coverage, and keep --batch-size 1; these knobs trade calibration time for memory headroom without discarding the quantized model.

AWQ's per-channel scaling also makes its quantized weights more hardware-friendly: the INT4 layout with group size 128 maps cleanly to tensor core operations on Ampere and Ada Lovelace without requiring the full Hessian computation GPTQ performs.

Known limits and when AWQ is the wrong choice

AWQ is a weight-only INT4 method. It does not quantize activations. If your deployment target requires W8A8 — either because your backend is optimized for INT8 GEMM throughput or because you're targeting Hopper FP8 hardware paths — AWQ is structurally the wrong choice regardless of its calibration convenience.

Watch Out: Two backend-driven limits override AWQ's calibration advantages: (1) Activation quantization requirementsLMDeploy explicitly separates the AWQ/GPTQ INT4 path from the SmoothQuant INT8/FP8 path; if your serving target demands W8A8 (e.g., for TensorRT-LLM INT8 throughput optimizations or Hopper FP8 utilization), AWQ cannot fulfill that requirement. (2) Kernel availabilityTensorRT-LLM and LMDeploy ship AWQ kernels for specific GPU families and engine builds; if your GPU or engine version lacks an AWQ kernel, the format label on your model weights does not help. Verify kernel availability against your exact engine release before committing to AWQ.

SmoothQuant: when INT8 or FP8 matters more than weight-only compression

SmoothQuant targets W8A8 quantization — compressing both weights and activations to INT8 — by migrating quantization difficulty from activations to weights. Transformer activations contain large outliers that make naive INT8 quantization lossy; SmoothQuant divides activation magnitudes by a per-channel smoothing factor α and multiplies the corresponding weight channel by the same factor, leaving the mathematical output unchanged while making both tensors easier to quantize at INT8 precision.

Comparison factor INT8 deployment target FP8 deployment target
Core use case W8A8 on INT8-capable back ends FP8 inference on Hopper H100 and Ada Lovelace
Weight format INT8 FP8
Activation format INT8 FP8
LMDeploy support Yes Yes
Typical hardware fit Ampere and later INT8 tensor core paths H100 and Ada Lovelace FP8 tensor cores

LMDeploy documents SmoothQuant as its dedicated INT8/FP8 inference path, separate from the AWQ/GPTQ INT4 route. The SmoothQuant paper validates W8A8 results on LLaMA-family models using WikiText-2 at sequence length 512, with per-token activation quantization and α=0.8 producing lossless perplexity relative to FP16 baselines.

Why SmoothQuant is the sensible option for activation quantization

The practical reason SmoothQuant exists is that pure weight-only INT4 does not help you on backends optimized for INT8 tensor core throughput. On Ampere and later hardware, INT8 GEMM throughput can significantly exceed FP16 GEMM throughput when both operands are quantized — but only if activations are also quantized. W4A16 methods like GPTQ and AWQ leave activations at FP16, so they do not unlock INT8 compute throughput. SmoothQuant does.

Pro Tip: SmoothQuant is purpose-built for W8A8. It is not a weight-only INT4 method, so if your serving objective is memory footprint reduction alone, GPTQ or AWQ will typically deliver better accuracy at INT4. If your objective is maximizing throughput on INT8-capable hardware — or accessing FP8 tensor cores on Hopper H100 — SmoothQuant is the method aligned to that hardware path. Framing it as "just another quantization option" misses why it was designed.

The smoothing factor α controls how much quantization difficulty is shifted from activations to weights. Higher α values shift more burden to weights; the paper's α=0.8 for LLaMA represents a tuned balance. In practice, different model families require different α values, which adds one calibration knob absent from the W4A16 methods.

GPU support and backend compatibility for SmoothQuant

SmoothQuant's INT8 path covers a broad set of NVIDIA GPU families in LMDeploy, while its FP8 path requires newer hardware.

GPU family Compute cap. INT8 (SmoothQuant) FP8 (SmoothQuant) INT4 (AWQ/GPTQ)
V100 sm70 ✓ (LMDeploy) ✓ (LMDeploy)
Turing / T4 sm75 ✓ (LMDeploy) ✓ (LMDeploy)
Ampere (A100, A10, A30, A16, 30-series) sm80/sm86 ✓ (LMDeploy) ✓ (LMDeploy)
Ada Lovelace sm89 ✓ (LMDeploy) ✓ (LMDeploy) ✓ (LMDeploy)
Hopper H100 sm90 ✓ (LMDeploy) ✓ (LMDeploy) Not documented in LMDeploy w4a16 path

TensorRT-LLM documents support for SmoothQuant alongside INT4-AWQ and INT4-GPTQ examples. FP8 support is specific to H100 and Ada Lovelace hardware; do not assume FP8 kernel availability on earlier GPU families even if the engine version claims FP8 support — verify against your specific engine build and GPU compute capability.

Benchmarks that separate quality from marketing

No single paper provides a controlled cross-method comparison of GPTQ, AWQ, and SmoothQuant on the same models with the same serving stack and the same hardware. The primary sources — the GPTQ paper, AWQ paper, SmoothQuant paper, and LMDeploy/TensorRT-LLM docs — each describe their own experimental setup. Treating their numbers as a unified leaderboard produces false conclusions.

Method Key accuracy result (per paper) Calibration setup Serving throughput data
GPTQ Minimal perplexity increase; OPT-175B/BLOOM-176B quantized in ~4 GPU-hours Layer-wise Hessian, wikitext2 Not benchmarked in paper vs. AWQ/SmoothQuant
AWQ Protects ~1% salient weights; group-128 INT4 representative setup 128 samples, wikitext2, 2048 seqlen Not directly compared to GPTQ in primary source
SmoothQuant Lossless W8A8 on LLaMA/WikiText-2, seqlen 512, α=0.8 Per-token activation quant, per-channel weight scale Not normalized to INT4 methods; different bit regime

The LMDeploy and TensorRT-LLM documentation provides deployment facts — which GPU families support which format, what calibration knobs exist — rather than model-by-model accuracy tables. That is the honest state of the primary documentation.

Calibration cost and re-quantization knobs

AWQ's calibration defaults in LMDeploy are concrete and fast: 128 samples from wikitext2 at sequence length 2048, group size 128, batch size 1. GPTQ's paper reports approximately four GPU-hours to quantize a 175B-scale model, though that figure covers the full quantization computation, not just calibration data loading.

Parameter AWQ (LMDeploy defaults) GPTQ (paper reference) SmoothQuant
Calibration samples 128 Not separately specified; full quantization ~4 GPU-hours for 175B Per-channel scale search
Sequence length 2048 (adjustable) 512 (paper); deployment-specific
Batch size 1 (OOM fallback: reduce seqlen, keep batch=1)
Accuracy fallback search-scale re-quantization pass Re-run with different calibration set Adjust α per model family
Memory relief Reduce --calib-seqlen; increase --calib-samples

No verified direct timing comparison between AWQ and GPTQ calibration on identical hardware was retrieved from the primary sources. The inference should be directional: AWQ's 128-sample search is structurally lighter than GPTQ's Hessian inversion across all layers of a large model.

Kernel and GPU compatibility matrix

In LMDeploy and TensorRT-LLM, GPTQ and AWQ share the same INT4 GPU coverage. SmoothQuant extends INT8 support across the full documented range and adds FP8 on H100 and Ada Lovelace.

GPU AWQ INT4 GPTQ INT4 SmoothQuant INT8 SmoothQuant FP8
V100 (sm70)
Turing / T4 (sm75)
Ampere A100/A10/A30 (sm80/86)
Ada Lovelace (sm89)
Hopper H100 (sm90) Not documented in LMDeploy w4a16 Not documented in LMDeploy w4a16

Source: LMDeploy AWQ/GPTQ docs and LMDeploy SmoothQuant docs. TensorRT-LLM provides INT4-AWQ, INT4-GPTQ, and SmoothQuant examples; verify kernel availability against your specific TensorRT-LLM release version and target compute capability.

Decision matrix for choosing the right quantization path

For a Llama 70B serving deployment, the answer is conditional on three variables: your GPU hardware, whether you need activation quantization, and how often you re-quantize. There is no universal winner across all three methods — the right choice follows from constraints, not benchmarks alone.

Workload / constraint Recommended method Reasoning
Llama 70B on A100 40GB, memory-limited AWQ or GPTQ INT4 Fits ~35GB quantized; leaves headroom for KV cache
Llama 70B on H100, throughput-first SmoothQuant FP8 FP8 tensor cores on H100; both W and A quantized
Frequent re-quantization after fine-tuning AWQ Faster calibration; single-command workflow
Accuracy-first, calibration time available GPTQ Hessian-based rounding; validated at 175B scale
V100 or T4 fleet, INT4 AWQ or GPTQ Both documented on sm70/sm75 in LMDeploy
Ada Lovelace, FP8 target SmoothQuant FP8 supported on sm89 per LMDeploy docs
Backend mandates W8A8 (e.g., INT8 GEMM) SmoothQuant Only method targeting activation quantization

Choose GPTQ when…

GPTQ is the right choice when accuracy preservation under INT4 is non-negotiable and you can afford the calibration compute. The method's Hessian-guided weight rounding produces minimal perplexity increase even at 175B scale, which matters for production deployments where quality regressions are caught by downstream evals rather than just perplexity. GPTQ is documented across TensorRT-LLM and LMDeploy TurboMind, so kernel coverage is not a limiting factor on the GPU families both backends support.

Choose GPTQ when: your team runs perplexity and task evals after every quantization run; calibration time (hours on a single GPU for 70B-scale models) fits your workflow; and you need the highest accuracy available at INT4 weight-only precision for hard prompts or coding tasks.

Choose AWQ when…

AWQ wins on iteration speed and GPU breadth. Its 128-sample wikitext2 calibration at sequence length 2048 completes fast enough to re-quantize after fine-tuning runs without blocking GPU capacity. LMDeploy documents AWQ INT4 inference from V100 through Ada Lovelace, making it the practical default for teams whose GPU fleet spans multiple generations.

Choose AWQ when: you need to quantize frequently or quickly; you are fitting a model onto a smaller GPU (A10, T4, 30-series consumer) where INT4 is the only option that fits; and your backend — LMDeploy TurboMind or TensorRT-LLM — has a verified AWQ kernel for your compute capability.

Choose SmoothQuant when…

SmoothQuant is the only sensible option when your deployment requires activation quantization. W4A16 methods like GPTQ and AWQ cannot unlock INT8 or FP8 compute throughput because activations remain at FP16. If you are targeting Hopper H100 FP8 tensor cores, running an INT8-optimized backend, or serving at batch sizes where compute rather than memory bandwidth is the bottleneck, SmoothQuant is structurally the correct method.

Choose SmoothQuant when: your hardware is H100 or Ada Lovelace and FP8 utilization matters; your backend requires W8A8 rather than W4A16; or your model's activation distribution responds well to the smoothing factor migration (validated for LLaMA-family models in the paper).

FAQ

Pro Tip: Before finalizing your deployment choice, verify kernel support and calibration defaults against the latest release of your chosen backend — LMDeploy w4a16 docs, LMDeploy w8a8 docs, or the TensorRT-LLM quantization docs. Kernel support evolves with engine releases; a GPU family listed in docs from six months ago may have additional formats in the current release.

Is AWQ faster than GPTQ for serving?

Faster calibration does not imply faster inference. AWQ's ~128-sample calibration is structurally lighter than GPTQ's layer-wise Hessian computation, but serving throughput depends on the kernel implementation in your serving engine, not the calibration algorithm. GPTQ's paper reports roughly four GPU-hours for a 175B-scale quantization run — a calibration-time figure, not a serving-time throughput number. No authoritative cross-stack serving throughput benchmark comparing AWQ and GPTQ was available from the primary sources at the time of publication. Both formats produce W4A16 weights; on the same GPU with the same kernel, throughput differences come from weight layout and kernel implementation, not algorithm labels. See the Kernel and GPU Compatibility Matrix for the GPU families where both are supported.

Do GPTQ and AWQ need the same kernel support?

In LMDeploy TurboMind, GPTQ and AWQ share the same documented INT4 GPU coverage: V100 (sm70), Turing/T4 (sm75), Ampere (sm80/sm86), and Ada Lovelace (sm89). In TensorRT-LLM, both appear as documented INT4 weight-only quantization paths. However, the kernel implementations differ — AWQ uses per-channel scaling; GPTQ uses Hessian-guided rounding — so their internal kernel code paths are distinct even when the supported GPU list overlaps. An engine build that supports INT4-AWQ does not automatically support INT4-GPTQ and vice versa; check your specific engine version. See the Kernel and GPU Compatibility Matrix.

Is SmoothQuant only for INT8?

No. SmoothQuant covers both INT8 (W8A8) and FP8 deployment targets. Its core mechanism — migrating activation quantization difficulty to weights via per-channel scaling — applies to both precision levels. LMDeploy documents SmoothQuant as its INT8/FP8 inference path, with FP8 support on Hopper H100 and Ada Lovelace. The SmoothQuant paper's primary results use W8A8 with per-token INT8 activation quantization on LLaMA models. SmoothQuant is not an INT4 weight-only method; if your target is W4A16 compression, use GPTQ or AWQ.

Sources and references

Keywords

GPTQ, AWQ, SmoothQuant, TensorRT-LLM, LMDeploy TurboMind, INT4, INT8, FP8, Wikitext2, Marlin kernel, NVIDIA V100, NVIDIA T4, NVIDIA A100, NVIDIA H100, Ada Lovelace GPUs

Was this guide helpful?

The weekly brief.

One email each Sunday with what we tested, what we'd buy, and what to skip. No filler.

Share: X · LinkedIn · Reddit