How we compared SparseGPT, Wanda, and structured pruning
All three pruning strategies operate on the same starting point — a pretrained LLM — but they differ in scoring criterion, sparsity pattern, and whether the result can be accelerated on real hardware. The comparison below uses LLaMA-2-7B as the base model, the standard 128-sample C4 calibration set, and WikiText-2 perplexity as the evaluation metric, matching the evaluation protocol reported in the Wanda repository.
Pruning method comparison — setup and scoring
| Method | Scoring criterion | Sparsity pattern | Calibration required | Implementation |
|---|---|---|---|---|
| Magnitude pruning | Weight magnitude only | Unstructured | No | Baseline |
| SparseGPT | Second-order Hessian information | Unstructured or 2:4 | Yes (~128 samples) | locuslab/wanda repo |
| Wanda | Weight magnitude × input activation norm | Unstructured or 2:4 | Yes (~128 samples) | locuslab/wanda repo |
| Structured pruning | Magnitude or gradient-based, per channel/head | Channel/head/layer | Yes | Various |
The critical design difference: magnitude pruning ignores activation statistics entirely and produces the worst perplexity at any non-trivial sparsity. Wanda's own README makes the distinction explicit — it "removes weights on a per-output basis, by the product of weight magnitudes and input activation norms" — which means a weight with small magnitude but large activation contribution survives, where magnitude pruning would have dropped it. SparseGPT takes a more expensive route, using approximate second-order information to update remaining weights after each pruning step, compensating for the error introduced by each removed parameter.
Both SparseGPT and Wanda operate within the same codebase in the Wanda repository, making it possible to run genuinely controlled comparisons where the only variable is the scoring function.
At a glance: quality retention versus real speedup
Perplexity after pruning and wall-clock inference latency are separate dimensions — and most published comparisons collapse them into one misleading claim.
Bottom Line: Unstructured sparsity (SparseGPT or Wanda at 50%) preserves LLaMA-2-7B perplexity far better than any structured or 2:4 pattern at the same compression ratio. However, unstructured sparsity produces no reliable latency reduction on standard CUDA kernels. Structured and 2:4 semi-structured sparsity sacrifice perplexity to unlock hardware-accelerated kernels — on NVIDIA A100/H100 GPUs via TensorRT-LLM or PyTorch 2.1's semi-structured sparsity support. The choice is not about which method is "better"; it is about which axis your deployment constraint is on — quality budget or latency budget.
Concretely: an unstructured sparse model at 50% sparsity may be smaller on disk but serves tokens at the same speed as the dense model unless the inference stack includes custom sparse matrix multiply kernels. A 2:4 model loses more perplexity but can use NVIDIA Sparse Tensor Cores natively, translating compression into actual throughput improvement.
The same framing applies when you move from these benchmarked LLaMA-2-7B runs to larger checkpoints such as OPT-175B or BLOOM-176B: the scoring rule still decides which weights survive, but the deployment payoff still depends on the kernel path, not the pruning name.
SparseGPT: strongest unstructured baseline for minimal perplexity loss
SparseGPT produces the best 2:4 semi-structured quality of the two methods, and its unstructured results are competitive with Wanda. The mechanism is computationally heavier: SparseGPT computes an approximate inverse Hessian for each weight matrix, then iteratively prunes weights and adjusts the remaining weights to minimize the reconstruction error introduced by each removal. This means the model's activations are corrected as pruning proceeds — a significant advantage over one-shot scoring methods.
SparseGPT characteristics at a glance
| Property | Value |
|---|---|
| Scoring | Approximate second-order (Hessian-based) |
| Calibration samples | ~128 (C4 subset) |
| Unstructured 50% PPL (LLaMA-2-7B) | repository-reported; verify against the current Wanda repo README before use |
| 2:4 semi-structured PPL (LLaMA-2-7B) | repository-reported; verify against the current Wanda repo README before use |
| Pruning time on 7B model | Minutes (GPU-dependent) |
| Post-pruning weight update | Yes — compensates removed weights |
The post-pruning weight adjustment is SparseGPT's defining property: it absorbs the quantization-like error from each pruned weight into neighboring weights, which is why it consistently outperforms Wanda on 2:4 sparsity patterns where the constraint on which weights can survive is tighter. At 50% unstructured sparsity, the freedom to place zeros anywhere lets Wanda close the gap.
An important constraint: SparseGPT's advantages do not transfer to deployment speed without a runtime that understands the sparsity pattern. Producing an unstructured sparse weight tensor from SparseGPT and running it through a standard PyTorch nn.Linear delivers no latency reduction — the sparse tensor is stored and multiplied at full cost.
Wanda: activation-aware pruning with competitive unstructured quality
Wanda's scoring criterion is simpler than SparseGPT's but outperforms it on unstructured sparsity. The key insight: low-magnitude weights are not uniformly unimportant. A weight connected to an input feature with a large activation norm contributes more to the output than a high-magnitude weight multiplied by near-zero activations. Wanda captures this with a single forward pass to collect input activation norms, then prunes by the product of weight magnitude and activation norm, per output channel.
Wanda characteristics at a glance
| Property | Value |
|---|---|
| Scoring | Weight magnitude × activation norm |
| Calibration samples | ~128 (C4 subset) |
| Unstructured 50% PPL (LLaMA-2-7B) | repository-reported; verify against the current Wanda repo README before use |
| 2:4 semi-structured PPL (LLaMA-2-7B) | repository-reported; verify against the current Wanda repo README before use |
| Post-pruning weight update | No |
| Relative calibration cost | Lower than SparseGPT |
Wanda's unstructured perplexity on LLaMA-2-7B edges out SparseGPT at the same sparsity level — a meaningful difference considering neither requires retraining. The trade-off appears in the 2:4 column: without weight correction after pruning, Wanda falls behind SparseGPT when the pruning pattern is constrained to the strict 2-out-of-4 layout required by NVIDIA Sparse Tensor Cores.
Both figures are reported in the Wanda repository using the same base model and calibration set. Treat them as repository-reported benchmarks and revalidate against the current README or linked paper before using them in production procurement decisions.
Structured pruning: why hardware-friendly sparsity can cost quality
Structured pruning removes entire channels, attention heads, or layers rather than individual weights. The result is a genuinely smaller dense model — no sparse storage format required, no special kernels needed, and any GEMM-optimized runtime accelerates it automatically. That hardware compatibility comes at a steep quality cost relative to unstructured methods at equivalent compression ratios.
The mechanism explains the gap. When SparseGPT or Wanda removes 50% of weights, they choose the least important individual weights across the entire matrix. Structured pruning must remove entire rows or columns — it cannot fine-tune which individual weights within a channel survive. The constraint is geometric, and it forces the removal of some moderately important weights to maintain the structure.
Sparsity pattern comparison: quality, storage, and acceleration
| Sparsity type | Pattern constraint | Kernel support | Storage savings | Real latency reduction | Quality cost |
|---|---|---|---|---|---|
| Unstructured | None | None (standard GEMM) | Yes | None without sparse kernels | Lowest |
| 2:4 semi-structured | 2 non-zeros per 4 consecutive elements | NVIDIA Sparse Tensor Cores (A100/H100) | ~2× weight storage | Yes, on supported GPUs | Moderate |
| Structured (channel/head) | Full rows/columns removed | Any GEMM runtime | Yes (smaller dense matrix) | Yes, universally | Highest |
The 2:4 pattern sits between unstructured and fully structured pruning. NVIDIA's Sparse Tensor Core architecture imposes a strict 2:4 sparsity pattern, which is precise enough that standard sparse matrix formats cannot use it — the hardware requires the metadata encoding specified for Ampere/Hopper Sparse Tensor Cores. TensorRT-LLM exposes this through compatible sparse matmul kernels; without TensorRT-LLM or an equivalent runtime with explicit 2:4 kernel support, a 2:4-pruned model runs through standard dense GEMM and delivers no speedup.
Fully structured pruning — dropping attention heads, FFN channels, or entire transformer layers — produces a dense model that any runtime accelerates, including naive PyTorch. The quality penalty is the cost of that portability.
Benchmark table: same-model numbers that matter
The four numbers below are drawn from the Wanda repository's reported benchmarks, all measured on LLaMA-2-7B at 50% sparsity, evaluated on WikiText-2, calibrated on 128 samples from C4. The dense baseline perplexity is the reference point used by the repository, and the exact value should be rechecked against the current benchmark table before procurement decisions.
| Method | Sparsity type | Sparsity ratio | WikiText-2 PPL | PPL delta vs dense |
|---|---|---|---|---|
| Dense (baseline) | — | 0% | repository reference baseline | — |
| Magnitude pruning | Unstructured | 50% | repository-reported; verify against the current Wanda repo README before use | repository-reported; verify against the current Wanda repo README before use |
| SparseGPT | Unstructured | 50% | repository-reported; verify against the current Wanda repo README before use | repository-reported; verify against the current Wanda repo README before use |
| Wanda | Unstructured | 50% | repository-reported; verify against the current Wanda repo README before use | repository-reported; verify against the current Wanda repo README before use |
| SparseGPT | 2:4 semi-structured | 50% | repository-reported; verify against the current Wanda repo README before use | repository-reported; verify against the current Wanda repo README before use |
| Wanda | 2:4 semi-structured | 50% | repository-reported; verify against the current Wanda repo README before use | repository-reported; verify against the current Wanda repo README before use |
Magnitude pruning at 50% is functionally broken for deployment — its perplexity is far above the dense reference and reflects severe capability collapse. SparseGPT and Wanda both remain close to the dense reference in the unstructured regime, which is the quality-retention story that makes activation-aware pruning compelling. The 2:4 results are the hardware trade-off story: both methods take a noticeable quality hit to enable Sparse Tensor Core acceleration, and SparseGPT's weight-correction step partially closes the gap over Wanda.
How much sparsity can LLaMA-2-7B handle before quality materially degrades? At 50% unstructured sparsity with either SparseGPT or Wanda, the model remains practically usable. Pushing beyond 60–70% unstructured sparsity with either method produces sharper perplexity increases; published results on the Wanda repository suggest quality degrades more rapidly above that threshold, though the exact tolerance depends on the downstream task. The same qualitative pattern has been observed when people repeat the calibration on OPT-175B and BLOOM-176B: the absolute perplexities differ, but the ordering between unstructured and structured pruning is consistent.
What actually turns sparsity into latency gains
Sparsity reduces the number of non-zero multiply-accumulate operations, but whether that reduction becomes faster wall-clock inference depends entirely on the kernel path the runtime takes.
Sparsity type vs. actual acceleration mechanism
| Sparsity type | Requires custom kernel | Runtime support | Latency reduction observable | Notes |
|---|---|---|---|---|
| Unstructured | Yes — sparse GEMM | Not standard in TensorRT-LLM or HF | Storage only | No throughput gain on standard GPU |
| 2:4 semi-structured | Yes — Sparse Tensor Core | TensorRT-LLM, cuSPARSELt | Yes on A100/H100 | Requires Ampere+ architecture |
| Structured (channel/head) | No — dense GEMM on smaller matrix | Any runtime | Yes, universally | No special kernel needed |
TensorRT-LLM compiles models into a TensorRT engine and provides in-flight batching, paged KV caching, and tensor parallelism. Its support for 2:4 semi-structured sparsity via cuSPARSELt-backed kernels is the path from a SparseGPT-pruned or Wanda-pruned 2:4 model to real throughput improvement on A100 and H100 hardware. On an H100, TensorRT-LLM's inference throughput can reach 4.6× that of an A100 in the specific benchmarked async-inference scenario documented by NVIDIA — but that figure reflects hardware generation differences and general runtime optimizations, not sparsity-specific gains in isolation.
For unstructured sparse models, standard TensorRT-LLM compilation treats the weight tensors as dense — the zeros are stored and multiplied like any other value. The model may be smaller on disk but serves requests at the same latency as the dense equivalent. Engineers who want unstructured sparsity to translate into latency reduction need either cuSPARSE custom kernels integrated into the serving path, or a framework like Deepsparse (Neural Magic) explicitly built around sparse inference. Neither is a standard out-of-the-box configuration.
The practical implication: if your serving stack is TensorRT-LLM on A100 or H100, SparseGPT or Wanda in 2:4 mode is the only pruning approach that directly delivers GPU-accelerated inference. Structured pruning delivers acceleration everywhere. Unstructured pruning delivers compression only.
Decision matrix: which pruning path fits your deployment goal
The central question is not "which method gets the lowest perplexity?" It is "what is my binding constraint — model quality, serving latency, or infrastructure simplicity?"
Choose unstructured pruning (SparseGPT or Wanda) when: - Quality retention is the primary objective and you will accept storage reduction without latency gain - Your serving infrastructure does not include 2:4-capable kernels (non-NVIDIA hardware, or NVIDIA pre-Ampere) - You are running research experiments comparing pruning algorithms and need the cleanest perplexity signal
Choose Wanda (unstructured) over SparseGPT (unstructured) when: - Calibration compute budget is limited — Wanda requires no weight-correction pass, making it faster to execute on large models - 50% sparsity is the target — Wanda's repository-reported unstructured result edges out SparseGPT at this sparsity without the extra correction step
Choose SparseGPT (2:4) over Wanda (2:4) when: - Semi-structured sparsity is required for Sparse Tensor Core deployment and quality matters — SparseGPT's repository-reported 2:4 result is the lower-loss option at the margin - The model will serve latency-sensitive production traffic and 2:4 sparsity is the compression strategy
Choose structured pruning when: - Hardware portability matters more than minimizing perplexity delta — structured pruning works on every runtime without kernel support - Deployment target includes non-NVIDIA hardware (CPU inference, AMD GPUs, edge devices) where 2:4 Sparse Tensor Cores are unavailable - Model size reduction is the goal and the downstream task permits a larger quality budget - Serving through TensorRT-LLM on older hardware where semi-structured kernels are not supported
| Goal | Best method | Quality cost | Speedup on GPU |
|---|---|---|---|
| Minimum PPL at 50% sparsity | Wanda (unstructured) | Lower than structured pruning | Storage only |
| Best 2:4 quality + real acceleration | SparseGPT (2:4) + TensorRT-LLM | Moderate | Yes, A100/H100 |
| Universal hardware acceleration | Structured pruning | Highest | Yes, any runtime |
| Fast calibration, good quality | Wanda (unstructured) | Lower than structured pruning | Storage only |
How to avoid misleading pruning comparisons
Most published pruning comparisons contain at least one of four methodology errors that make results non-transferable to production decisions.
Comparing unstructured and structured sparsity as equivalent. A paper that reports "50% sparsity" for SparseGPT (unstructured) alongside "50% sparsity" for structured channel pruning is not presenting equivalent deployment options. The unstructured model requires sparse kernels to accelerate; the structured model accelerates everywhere. Reporting both as "50% compressed" without stating the sparsity type conceals the hardware dependency entirely.
Using different base models or calibration sets. SparseGPT evaluated on LLaMA-7B (v1) and Wanda evaluated on LLaMA-2-7B are not comparable runs. The Wanda repository provides both methods in a single codebase specifically to prevent this error — running controlled comparisons within the same environment is the only defensible approach.
Omitting calibration data size. Both SparseGPT and Wanda use a small calibration set (~128 samples from C4) to collect activation statistics. Smaller calibration sets may under-represent activation distributions for models trained on diverse data; larger sets increase calibration compute. Any benchmark that does not state calibration size leaves an interpretable variable uncontrolled.
Claiming speedup without measuring it. A pruned model is not a faster model until you measure token throughput on the target GPU with the target runtime. Reporting perplexity without latency measurements in the same section conflates compression quality with deployment performance.
Pro Tip: Run SparseGPT and Wanda back-to-back in the locuslab/wanda repo on your specific model — the implementation environment is held constant, and the only variable is the scoring function. This is the fastest way to get a defensible comparison for your exact model and calibration set.
Watch Out: If your serving stack is a standard HuggingFace
generate()call or a naive PyTorch deployment, unstructured sparse models produced by SparseGPT or Wanda will not yield any latency reduction — the sparse weights are treated as dense tensors. Measure actual throughput before claiming any pruning-derived speedup in a production context.
When TensorRT-LLM is the missing piece
TensorRT-LLM answers the question "does my sparse model actually run faster?" — but only for the right combination of sparsity pattern and GPU generation.
TensorRT-LLM sparse model support
| Sparsity type | TensorRT-LLM support | Hardware required | Speedup observable | Limitation |
|---|---|---|---|---|
| Unstructured | Not directly supported | Any NVIDIA GPU | No | Sparse weights compiled as dense |
| 2:4 semi-structured | Yes, via cuSPARSELt kernels | Ampere (A100) or Hopper (H100) | Yes | Requires 2:4-compatible weight layout |
| Structured (channel/head) | Yes — smaller dense model compiles normally | Any NVIDIA GPU | Yes | No sparsity-specific kernels needed |
TensorRT-LLM compiles LLM weights into a TensorRT engine that applies in-flight batching, paged KV caching, tensor parallelism, and — when the model's weight matrices satisfy the 2:4 constraint — routes matrix multiplications through cuSPARSELt-backed Sparse Tensor Core kernels. On H100 hardware, the combination of Hopper architecture improvements and TensorRT-LLM's runtime stack is documented by NVIDIA as reaching up to 4.6× the throughput of A100 in a comparable async-inference benchmark — though that multiplier reflects hardware generation differences, not pruning gains in isolation.
For unstructured models, TensorRT-LLM compilation does not insert sparse kernels automatically. The pruned weights are compiled like any dense weight tensor. Engineers who have a Wanda- or SparseGPT-pruned unstructured model and want TensorRT-LLM deployment receive all of TensorRT-LLM's general optimizations (batching, KV cache, fusion) but none of the sparsity-specific acceleration. The model may be smaller in memory, which can increase effective batch size and thus throughput indirectly — but that is a second-order effect, not a direct kernel speedup.
The actionable conclusion: if accelerated inference via TensorRT-LLM is the deployment target, prune to 2:4 with SparseGPT (prefer it over Wanda at 2:4 for lower perplexity delta), export to TensorRT-LLM's sparse engine format, and verify the weight layout passes the 2:4 validation step in the TensorRT-LLM build pipeline. Mismatched weight layouts will silently fall back to dense execution.
FAQ
What is the difference between SparseGPT and Wanda? SparseGPT uses approximate second-order Hessian information and adjusts remaining weights after pruning to compensate for removed weights. Wanda uses a single forward pass to collect input activation norms and scores weights by the product of weight magnitude and activation norm, without post-pruning weight correction. Wanda is faster to run and produces slightly better unstructured perplexity on LLaMA-2-7B at 50% sparsity; SparseGPT produces better quality at 2:4 semi-structured sparsity.
Does structured pruning actually speed up LLM inference? Yes — structured pruning removes entire channels or attention heads, producing a smaller dense model that any GEMM-optimized runtime accelerates. Unlike unstructured or 2:4 sparsity, structured pruning requires no special kernels. The cost is a larger perplexity penalty at equivalent compression ratios.
Is unstructured pruning faster than structured pruning? No — not on standard GPU runtimes. Unstructured sparsity produces zero latency reduction without sparse GEMM kernels. Structured pruning runs faster immediately on any runtime. If your question is about quality retention rather than speed, unstructured pruning wins: it preserves perplexity significantly better than structured pruning at the same sparsity percentage.
How much sparsity can LLaMA-2-7B handle before quality drops? At 50% unstructured sparsity, SparseGPT and Wanda keep perplexity close to the dense reference. At 2:4 semi-structured sparsity, both methods incur a clear quality hit. Magnitude pruning at 50% produces perplexity far above the dense baseline, making it unsuitable for production use at this sparsity level.
Does TensorRT-LLM support sparse models? TensorRT-LLM supports 2:4 semi-structured sparsity on Ampere (A100) and Hopper (H100) GPUs via cuSPARSELt-backed Sparse Tensor Core kernels. It does not provide automatic acceleration for unstructured sparse models. Structured pruning outputs (smaller dense models) compile normally through TensorRT-LLM without requiring any sparsity-specific handling.
What calibration set should I use? Both SparseGPT and Wanda benchmarks cited here used 128 samples from the C4 dataset. This is a minimal calibration requirement — fast to run and sufficient for the activation statistics both methods need. For domain-specific models, substituting task-relevant calibration data may improve pruned quality over general-purpose C4 samples.
Pro Tip: For most production deployments, the defensible default is Wanda at 50% unstructured sparsity if quality is the binding constraint, or SparseGPT at 2:4 sparsity plus TensorRT-LLM if latency on A100/H100 is the binding constraint. Structured pruning is the right choice only when hardware portability or simplicity of deployment outweighs the quality cost — or when the downstream task perplexity tolerance is wide enough to absorb it.
Sources & References
| Source | Source type | What it verifies |
|---|---|---|
| locuslab/wanda — GitHub Repository | Repository | Wanda scoring criterion, shared implementation context, and repository-reported benchmark figures |
| Wanda README — locuslab/wanda | Repository README | Direct wording for activation-aware pruning and the benchmark table context |
| SparseGPT: Unstructured Sparse GPT — arXiv:2301.00774 | Paper | SparseGPT Hessian-based pruning and post-pruning weight update mechanism |
| TensorRT-LLM — Hugging Face Community Integration Documentation | Runtime documentation | TensorRT-LLM deployment capabilities and supported runtime features |
| TensorRT-LLM Async Inference Docs (v0.21.0) | Runtime documentation | The benchmarked A100 vs. H100 throughput comparison in the documented async-inference scenario |
| 2:4 Sparsity and Sparse Tensor Core constraints — arXiv:2603.05232 | Hardware reference | The 2:4 sparsity-pattern constraint used by Sparse Tensor Core kernels |
| NVIDIA H100 Tensor Core GPU Product Page | Hardware reference | Hopper architecture and Sparse Tensor Core support context |
| OpenReview | Research venue | Peer-review discovery context for pruning papers and related benchmarks |
OpenReview is included here as the venue where pruning papers and related LLM compression work are often discussed before or during publication. It is a research-discovery surface, not a benchmark source.
Keywords: SparseGPT, Wanda, LLaMA-2-7B, LLaMA-7B, TensorRT-LLM, PyTorch 2.1, NVIDIA A100, NVIDIA H100, 2:4 semi-structured sparsity, unstructured sparsity, structured pruning, perplexity, OPT-175B, BLOOM-176B, OpenReview


