Skip to content
AxiomLogicaSearch
AI & ML

SparseGPT vs Wanda vs structured pruning: what actually preserves LLM quality under compression

SparseGPT and Wanda usually preserve perplexity better than structured pruning at the same sparsity, but structured pruning is the only one that reliably maps to hardware speedups without specialized kernels — so the real decision is quality retention vs deployable acceleration, not sparsity percentage alone.

SparseGPT vs Wanda vs structured pruning: what actually preserves LLM quality under compression
SparseGPT vs Wanda vs structured pruning: what actually preserves LLM quality under compression

How we compared SparseGPT, Wanda, and structured pruning

All three pruning strategies operate on the same starting point — a pretrained LLM — but they differ in scoring criterion, sparsity pattern, and whether the result can be accelerated on real hardware. The comparison below uses LLaMA-2-7B as the base model, the standard 128-sample C4 calibration set, and WikiText-2 perplexity as the evaluation metric, matching the evaluation protocol reported in the Wanda repository.

Pruning method comparison — setup and scoring

Method Scoring criterion Sparsity pattern Calibration required Implementation
Magnitude pruning Weight magnitude only Unstructured No Baseline
SparseGPT Second-order Hessian information Unstructured or 2:4 Yes (~128 samples) locuslab/wanda repo
Wanda Weight magnitude × input activation norm Unstructured or 2:4 Yes (~128 samples) locuslab/wanda repo
Structured pruning Magnitude or gradient-based, per channel/head Channel/head/layer Yes Various

The critical design difference: magnitude pruning ignores activation statistics entirely and produces the worst perplexity at any non-trivial sparsity. Wanda's own README makes the distinction explicit — it "removes weights on a per-output basis, by the product of weight magnitudes and input activation norms" — which means a weight with small magnitude but large activation contribution survives, where magnitude pruning would have dropped it. SparseGPT takes a more expensive route, using approximate second-order information to update remaining weights after each pruning step, compensating for the error introduced by each removed parameter.

Both SparseGPT and Wanda operate within the same codebase in the Wanda repository, making it possible to run genuinely controlled comparisons where the only variable is the scoring function.


At a glance: quality retention versus real speedup

Perplexity after pruning and wall-clock inference latency are separate dimensions — and most published comparisons collapse them into one misleading claim.

Bottom Line: Unstructured sparsity (SparseGPT or Wanda at 50%) preserves LLaMA-2-7B perplexity far better than any structured or 2:4 pattern at the same compression ratio. However, unstructured sparsity produces no reliable latency reduction on standard CUDA kernels. Structured and 2:4 semi-structured sparsity sacrifice perplexity to unlock hardware-accelerated kernels — on NVIDIA A100/H100 GPUs via TensorRT-LLM or PyTorch 2.1's semi-structured sparsity support. The choice is not about which method is "better"; it is about which axis your deployment constraint is on — quality budget or latency budget.

Concretely: an unstructured sparse model at 50% sparsity may be smaller on disk but serves tokens at the same speed as the dense model unless the inference stack includes custom sparse matrix multiply kernels. A 2:4 model loses more perplexity but can use NVIDIA Sparse Tensor Cores natively, translating compression into actual throughput improvement.

The same framing applies when you move from these benchmarked LLaMA-2-7B runs to larger checkpoints such as OPT-175B or BLOOM-176B: the scoring rule still decides which weights survive, but the deployment payoff still depends on the kernel path, not the pruning name.


SparseGPT: strongest unstructured baseline for minimal perplexity loss

SparseGPT produces the best 2:4 semi-structured quality of the two methods, and its unstructured results are competitive with Wanda. The mechanism is computationally heavier: SparseGPT computes an approximate inverse Hessian for each weight matrix, then iteratively prunes weights and adjusts the remaining weights to minimize the reconstruction error introduced by each removal. This means the model's activations are corrected as pruning proceeds — a significant advantage over one-shot scoring methods.

SparseGPT characteristics at a glance

Property Value
Scoring Approximate second-order (Hessian-based)
Calibration samples ~128 (C4 subset)
Unstructured 50% PPL (LLaMA-2-7B) repository-reported; verify against the current Wanda repo README before use
2:4 semi-structured PPL (LLaMA-2-7B) repository-reported; verify against the current Wanda repo README before use
Pruning time on 7B model Minutes (GPU-dependent)
Post-pruning weight update Yes — compensates removed weights

The post-pruning weight adjustment is SparseGPT's defining property: it absorbs the quantization-like error from each pruned weight into neighboring weights, which is why it consistently outperforms Wanda on 2:4 sparsity patterns where the constraint on which weights can survive is tighter. At 50% unstructured sparsity, the freedom to place zeros anywhere lets Wanda close the gap.

An important constraint: SparseGPT's advantages do not transfer to deployment speed without a runtime that understands the sparsity pattern. Producing an unstructured sparse weight tensor from SparseGPT and running it through a standard PyTorch nn.Linear delivers no latency reduction — the sparse tensor is stored and multiplied at full cost.


Wanda: activation-aware pruning with competitive unstructured quality

Wanda's scoring criterion is simpler than SparseGPT's but outperforms it on unstructured sparsity. The key insight: low-magnitude weights are not uniformly unimportant. A weight connected to an input feature with a large activation norm contributes more to the output than a high-magnitude weight multiplied by near-zero activations. Wanda captures this with a single forward pass to collect input activation norms, then prunes by the product of weight magnitude and activation norm, per output channel.

"Compared to magnitude pruning which removes weights solely based on their magnitudes, our pruning approach Wanda removes weights on a per-output basis, by the product of weight magnitudes and input activation norms." — locuslab/wanda README

Wanda characteristics at a glance

Property Value
Scoring Weight magnitude × activation norm
Calibration samples ~128 (C4 subset)
Unstructured 50% PPL (LLaMA-2-7B) repository-reported; verify against the current Wanda repo README before use
2:4 semi-structured PPL (LLaMA-2-7B) repository-reported; verify against the current Wanda repo README before use
Post-pruning weight update No
Relative calibration cost Lower than SparseGPT

Wanda's unstructured perplexity on LLaMA-2-7B edges out SparseGPT at the same sparsity level — a meaningful difference considering neither requires retraining. The trade-off appears in the 2:4 column: without weight correction after pruning, Wanda falls behind SparseGPT when the pruning pattern is constrained to the strict 2-out-of-4 layout required by NVIDIA Sparse Tensor Cores.

Both figures are reported in the Wanda repository using the same base model and calibration set. Treat them as repository-reported benchmarks and revalidate against the current README or linked paper before using them in production procurement decisions.


Structured pruning: why hardware-friendly sparsity can cost quality

Structured pruning removes entire channels, attention heads, or layers rather than individual weights. The result is a genuinely smaller dense model — no sparse storage format required, no special kernels needed, and any GEMM-optimized runtime accelerates it automatically. That hardware compatibility comes at a steep quality cost relative to unstructured methods at equivalent compression ratios.

The mechanism explains the gap. When SparseGPT or Wanda removes 50% of weights, they choose the least important individual weights across the entire matrix. Structured pruning must remove entire rows or columns — it cannot fine-tune which individual weights within a channel survive. The constraint is geometric, and it forces the removal of some moderately important weights to maintain the structure.

Sparsity pattern comparison: quality, storage, and acceleration

Sparsity type Pattern constraint Kernel support Storage savings Real latency reduction Quality cost
Unstructured None None (standard GEMM) Yes None without sparse kernels Lowest
2:4 semi-structured 2 non-zeros per 4 consecutive elements NVIDIA Sparse Tensor Cores (A100/H100) ~2× weight storage Yes, on supported GPUs Moderate
Structured (channel/head) Full rows/columns removed Any GEMM runtime Yes (smaller dense matrix) Yes, universally Highest

The 2:4 pattern sits between unstructured and fully structured pruning. NVIDIA's Sparse Tensor Core architecture imposes a strict 2:4 sparsity pattern, which is precise enough that standard sparse matrix formats cannot use it — the hardware requires the metadata encoding specified for Ampere/Hopper Sparse Tensor Cores. TensorRT-LLM exposes this through compatible sparse matmul kernels; without TensorRT-LLM or an equivalent runtime with explicit 2:4 kernel support, a 2:4-pruned model runs through standard dense GEMM and delivers no speedup.

Fully structured pruning — dropping attention heads, FFN channels, or entire transformer layers — produces a dense model that any runtime accelerates, including naive PyTorch. The quality penalty is the cost of that portability.


Benchmark table: same-model numbers that matter

The four numbers below are drawn from the Wanda repository's reported benchmarks, all measured on LLaMA-2-7B at 50% sparsity, evaluated on WikiText-2, calibrated on 128 samples from C4. The dense baseline perplexity is the reference point used by the repository, and the exact value should be rechecked against the current benchmark table before procurement decisions.

Method Sparsity type Sparsity ratio WikiText-2 PPL PPL delta vs dense
Dense (baseline) 0% repository reference baseline
Magnitude pruning Unstructured 50% repository-reported; verify against the current Wanda repo README before use repository-reported; verify against the current Wanda repo README before use
SparseGPT Unstructured 50% repository-reported; verify against the current Wanda repo README before use repository-reported; verify against the current Wanda repo README before use
Wanda Unstructured 50% repository-reported; verify against the current Wanda repo README before use repository-reported; verify against the current Wanda repo README before use
SparseGPT 2:4 semi-structured 50% repository-reported; verify against the current Wanda repo README before use repository-reported; verify against the current Wanda repo README before use
Wanda 2:4 semi-structured 50% repository-reported; verify against the current Wanda repo README before use repository-reported; verify against the current Wanda repo README before use

Magnitude pruning at 50% is functionally broken for deployment — its perplexity is far above the dense reference and reflects severe capability collapse. SparseGPT and Wanda both remain close to the dense reference in the unstructured regime, which is the quality-retention story that makes activation-aware pruning compelling. The 2:4 results are the hardware trade-off story: both methods take a noticeable quality hit to enable Sparse Tensor Core acceleration, and SparseGPT's weight-correction step partially closes the gap over Wanda.

How much sparsity can LLaMA-2-7B handle before quality materially degrades? At 50% unstructured sparsity with either SparseGPT or Wanda, the model remains practically usable. Pushing beyond 60–70% unstructured sparsity with either method produces sharper perplexity increases; published results on the Wanda repository suggest quality degrades more rapidly above that threshold, though the exact tolerance depends on the downstream task. The same qualitative pattern has been observed when people repeat the calibration on OPT-175B and BLOOM-176B: the absolute perplexities differ, but the ordering between unstructured and structured pruning is consistent.


What actually turns sparsity into latency gains

Sparsity reduces the number of non-zero multiply-accumulate operations, but whether that reduction becomes faster wall-clock inference depends entirely on the kernel path the runtime takes.

Sparsity type vs. actual acceleration mechanism

Sparsity type Requires custom kernel Runtime support Latency reduction observable Notes
Unstructured Yes — sparse GEMM Not standard in TensorRT-LLM or HF Storage only No throughput gain on standard GPU
2:4 semi-structured Yes — Sparse Tensor Core TensorRT-LLM, cuSPARSELt Yes on A100/H100 Requires Ampere+ architecture
Structured (channel/head) No — dense GEMM on smaller matrix Any runtime Yes, universally No special kernel needed

TensorRT-LLM compiles models into a TensorRT engine and provides in-flight batching, paged KV caching, and tensor parallelism. Its support for 2:4 semi-structured sparsity via cuSPARSELt-backed kernels is the path from a SparseGPT-pruned or Wanda-pruned 2:4 model to real throughput improvement on A100 and H100 hardware. On an H100, TensorRT-LLM's inference throughput can reach 4.6× that of an A100 in the specific benchmarked async-inference scenario documented by NVIDIA — but that figure reflects hardware generation differences and general runtime optimizations, not sparsity-specific gains in isolation.

For unstructured sparse models, standard TensorRT-LLM compilation treats the weight tensors as dense — the zeros are stored and multiplied like any other value. The model may be smaller on disk but serves requests at the same latency as the dense equivalent. Engineers who want unstructured sparsity to translate into latency reduction need either cuSPARSE custom kernels integrated into the serving path, or a framework like Deepsparse (Neural Magic) explicitly built around sparse inference. Neither is a standard out-of-the-box configuration.

The practical implication: if your serving stack is TensorRT-LLM on A100 or H100, SparseGPT or Wanda in 2:4 mode is the only pruning approach that directly delivers GPU-accelerated inference. Structured pruning delivers acceleration everywhere. Unstructured pruning delivers compression only.


Decision matrix: which pruning path fits your deployment goal

The central question is not "which method gets the lowest perplexity?" It is "what is my binding constraint — model quality, serving latency, or infrastructure simplicity?"

Choose unstructured pruning (SparseGPT or Wanda) when: - Quality retention is the primary objective and you will accept storage reduction without latency gain - Your serving infrastructure does not include 2:4-capable kernels (non-NVIDIA hardware, or NVIDIA pre-Ampere) - You are running research experiments comparing pruning algorithms and need the cleanest perplexity signal

Choose Wanda (unstructured) over SparseGPT (unstructured) when: - Calibration compute budget is limited — Wanda requires no weight-correction pass, making it faster to execute on large models - 50% sparsity is the target — Wanda's repository-reported unstructured result edges out SparseGPT at this sparsity without the extra correction step

Choose SparseGPT (2:4) over Wanda (2:4) when: - Semi-structured sparsity is required for Sparse Tensor Core deployment and quality matters — SparseGPT's repository-reported 2:4 result is the lower-loss option at the margin - The model will serve latency-sensitive production traffic and 2:4 sparsity is the compression strategy

Choose structured pruning when: - Hardware portability matters more than minimizing perplexity delta — structured pruning works on every runtime without kernel support - Deployment target includes non-NVIDIA hardware (CPU inference, AMD GPUs, edge devices) where 2:4 Sparse Tensor Cores are unavailable - Model size reduction is the goal and the downstream task permits a larger quality budget - Serving through TensorRT-LLM on older hardware where semi-structured kernels are not supported

Goal Best method Quality cost Speedup on GPU
Minimum PPL at 50% sparsity Wanda (unstructured) Lower than structured pruning Storage only
Best 2:4 quality + real acceleration SparseGPT (2:4) + TensorRT-LLM Moderate Yes, A100/H100
Universal hardware acceleration Structured pruning Highest Yes, any runtime
Fast calibration, good quality Wanda (unstructured) Lower than structured pruning Storage only

How to avoid misleading pruning comparisons

Most published pruning comparisons contain at least one of four methodology errors that make results non-transferable to production decisions.

Comparing unstructured and structured sparsity as equivalent. A paper that reports "50% sparsity" for SparseGPT (unstructured) alongside "50% sparsity" for structured channel pruning is not presenting equivalent deployment options. The unstructured model requires sparse kernels to accelerate; the structured model accelerates everywhere. Reporting both as "50% compressed" without stating the sparsity type conceals the hardware dependency entirely.

Using different base models or calibration sets. SparseGPT evaluated on LLaMA-7B (v1) and Wanda evaluated on LLaMA-2-7B are not comparable runs. The Wanda repository provides both methods in a single codebase specifically to prevent this error — running controlled comparisons within the same environment is the only defensible approach.

Omitting calibration data size. Both SparseGPT and Wanda use a small calibration set (~128 samples from C4) to collect activation statistics. Smaller calibration sets may under-represent activation distributions for models trained on diverse data; larger sets increase calibration compute. Any benchmark that does not state calibration size leaves an interpretable variable uncontrolled.

Claiming speedup without measuring it. A pruned model is not a faster model until you measure token throughput on the target GPU with the target runtime. Reporting perplexity without latency measurements in the same section conflates compression quality with deployment performance.

Pro Tip: Run SparseGPT and Wanda back-to-back in the locuslab/wanda repo on your specific model — the implementation environment is held constant, and the only variable is the scoring function. This is the fastest way to get a defensible comparison for your exact model and calibration set.

Watch Out: If your serving stack is a standard HuggingFace generate() call or a naive PyTorch deployment, unstructured sparse models produced by SparseGPT or Wanda will not yield any latency reduction — the sparse weights are treated as dense tensors. Measure actual throughput before claiming any pruning-derived speedup in a production context.


When TensorRT-LLM is the missing piece

TensorRT-LLM answers the question "does my sparse model actually run faster?" — but only for the right combination of sparsity pattern and GPU generation.

TensorRT-LLM sparse model support

Sparsity type TensorRT-LLM support Hardware required Speedup observable Limitation
Unstructured Not directly supported Any NVIDIA GPU No Sparse weights compiled as dense
2:4 semi-structured Yes, via cuSPARSELt kernels Ampere (A100) or Hopper (H100) Yes Requires 2:4-compatible weight layout
Structured (channel/head) Yes — smaller dense model compiles normally Any NVIDIA GPU Yes No sparsity-specific kernels needed

TensorRT-LLM compiles LLM weights into a TensorRT engine that applies in-flight batching, paged KV caching, tensor parallelism, and — when the model's weight matrices satisfy the 2:4 constraint — routes matrix multiplications through cuSPARSELt-backed Sparse Tensor Core kernels. On H100 hardware, the combination of Hopper architecture improvements and TensorRT-LLM's runtime stack is documented by NVIDIA as reaching up to 4.6× the throughput of A100 in a comparable async-inference benchmark — though that multiplier reflects hardware generation differences, not pruning gains in isolation.

For unstructured models, TensorRT-LLM compilation does not insert sparse kernels automatically. The pruned weights are compiled like any dense weight tensor. Engineers who have a Wanda- or SparseGPT-pruned unstructured model and want TensorRT-LLM deployment receive all of TensorRT-LLM's general optimizations (batching, KV cache, fusion) but none of the sparsity-specific acceleration. The model may be smaller in memory, which can increase effective batch size and thus throughput indirectly — but that is a second-order effect, not a direct kernel speedup.

The actionable conclusion: if accelerated inference via TensorRT-LLM is the deployment target, prune to 2:4 with SparseGPT (prefer it over Wanda at 2:4 for lower perplexity delta), export to TensorRT-LLM's sparse engine format, and verify the weight layout passes the 2:4 validation step in the TensorRT-LLM build pipeline. Mismatched weight layouts will silently fall back to dense execution.


FAQ

What is the difference between SparseGPT and Wanda? SparseGPT uses approximate second-order Hessian information and adjusts remaining weights after pruning to compensate for removed weights. Wanda uses a single forward pass to collect input activation norms and scores weights by the product of weight magnitude and activation norm, without post-pruning weight correction. Wanda is faster to run and produces slightly better unstructured perplexity on LLaMA-2-7B at 50% sparsity; SparseGPT produces better quality at 2:4 semi-structured sparsity.

Does structured pruning actually speed up LLM inference? Yes — structured pruning removes entire channels or attention heads, producing a smaller dense model that any GEMM-optimized runtime accelerates. Unlike unstructured or 2:4 sparsity, structured pruning requires no special kernels. The cost is a larger perplexity penalty at equivalent compression ratios.

Is unstructured pruning faster than structured pruning? No — not on standard GPU runtimes. Unstructured sparsity produces zero latency reduction without sparse GEMM kernels. Structured pruning runs faster immediately on any runtime. If your question is about quality retention rather than speed, unstructured pruning wins: it preserves perplexity significantly better than structured pruning at the same sparsity percentage.

How much sparsity can LLaMA-2-7B handle before quality drops? At 50% unstructured sparsity, SparseGPT and Wanda keep perplexity close to the dense reference. At 2:4 semi-structured sparsity, both methods incur a clear quality hit. Magnitude pruning at 50% produces perplexity far above the dense baseline, making it unsuitable for production use at this sparsity level.

Does TensorRT-LLM support sparse models? TensorRT-LLM supports 2:4 semi-structured sparsity on Ampere (A100) and Hopper (H100) GPUs via cuSPARSELt-backed Sparse Tensor Core kernels. It does not provide automatic acceleration for unstructured sparse models. Structured pruning outputs (smaller dense models) compile normally through TensorRT-LLM without requiring any sparsity-specific handling.

What calibration set should I use? Both SparseGPT and Wanda benchmarks cited here used 128 samples from the C4 dataset. This is a minimal calibration requirement — fast to run and sufficient for the activation statistics both methods need. For domain-specific models, substituting task-relevant calibration data may improve pruned quality over general-purpose C4 samples.

Pro Tip: For most production deployments, the defensible default is Wanda at 50% unstructured sparsity if quality is the binding constraint, or SparseGPT at 2:4 sparsity plus TensorRT-LLM if latency on A100/H100 is the binding constraint. Structured pruning is the right choice only when hardware portability or simplicity of deployment outweighs the quality cost — or when the downstream task perplexity tolerance is wide enough to absorb it.


Sources & References

Source Source type What it verifies
locuslab/wanda — GitHub Repository Repository Wanda scoring criterion, shared implementation context, and repository-reported benchmark figures
Wanda README — locuslab/wanda Repository README Direct wording for activation-aware pruning and the benchmark table context
SparseGPT: Unstructured Sparse GPT — arXiv:2301.00774 Paper SparseGPT Hessian-based pruning and post-pruning weight update mechanism
TensorRT-LLM — Hugging Face Community Integration Documentation Runtime documentation TensorRT-LLM deployment capabilities and supported runtime features
TensorRT-LLM Async Inference Docs (v0.21.0) Runtime documentation The benchmarked A100 vs. H100 throughput comparison in the documented async-inference scenario
2:4 Sparsity and Sparse Tensor Core constraints — arXiv:2603.05232 Hardware reference The 2:4 sparsity-pattern constraint used by Sparse Tensor Core kernels
NVIDIA H100 Tensor Core GPU Product Page Hardware reference Hopper architecture and Sparse Tensor Core support context
OpenReview Research venue Peer-review discovery context for pruning papers and related benchmarks

OpenReview is included here as the venue where pruning papers and related LLM compression work are often discussed before or during publication. It is a research-discovery surface, not a benchmark source.


Keywords: SparseGPT, Wanda, LLaMA-2-7B, LLaMA-7B, TensorRT-LLM, PyTorch 2.1, NVIDIA A100, NVIDIA H100, 2:4 semi-structured sparsity, unstructured sparsity, structured pruning, perplexity, OPT-175B, BLOOM-176B, OpenReview

Was this guide helpful?

The weekly brief.

One email each Sunday with what we tested, what we'd buy, and what to skip. No filler.

Share: X · LinkedIn · Reddit