Why benchmark contamination still distorts 2026 evaluation
Benchmark-based evaluation remains the de facto standard for comparing large language models, but its reliability is increasingly threatened by a straightforward failure mode: test samples — or close variants — leak into training data and artificially inflate reported performance. As Jianzhe Chai et al. state directly in When Benchmarks Leak: Inference-Time Decontamination for LLMs: "Benchmark-based evaluation is the de facto standard for comparing large language models (LLMs). However, its reliability is increasingly threatened by test set contamination, where test samples or their close variants leak into training data and artificially inflate reported performance."
The scope is not hypothetical. The paper cites prior findings documenting inflated benchmark performance across six popular benchmarks and over 15 LLMs, with estimated contamination rates ranging from 1% to 45%. At 45%, contamination is not a rounding error — it is a systematic bias that renders score comparisons unreliable. The core problem persists in 2026 because LLM pretraining corpora are large, heterogeneous, and difficult to audit; any benchmark that has been publicly released for more than a few months carries non-trivial risk of partial training exposure.
Bottom Line: Benchmark contamination inflates reported scores across the evaluation ecosystem. Prior contamination rates on widely-used benchmarks reach 45% in documented cases, meaning score comparisons between models can reflect leakage rather than generalization. Inference-time decontamination, specifically the DeconIEP framework from arXiv:2601.19334, addresses this without discarding benchmark data — but it is bounded by a perturbation budget and introduces a clean-accuracy trade-off that practitioners must account for.
How DeconIEP moves decontamination into inference time
DeconIEP changes the unit of intervention from the dataset to the forward pass. Instead of removing, rewriting, or quarantining benchmark examples before evaluation, DeconIEP applies small, ℓ∞-bounded perturbations to the input embeddings at inference time, steering the model's internal representations away from memorization-driven shortcut pathways without modifying the benchmark itself.
The architecture involves three interacting components: the evaluated model (which may be contaminated), a reference model (selected to be relatively less contaminated), and an instance-adaptive perturbation generator trained to minimize the divergence between the evaluated model's behavior and the reference model's behavior on each input. The benchmark prompt is preserved verbatim; only the embedding representation seen by the contaminated model changes.
flowchart LR
A["Benchmark Prompt\n(unchanged text)"] --> B["Tokenizer / Embedding Layer"]
B --> C["Perturbation Generator\n(instance-adaptive)"]
C -->|"ℓ∞-bounded δ"| D["Perturbed Embedding"]
D --> E["Evaluated Model\n(potentially contaminated)"]
E --> F["Decontaminated Output / Score"]
G["Reference Model\n(less-contaminated)"] -->|"Guidance signal\nduring training"| C
style C fill:#f0e6ff,stroke:#7c4dff
style G fill:#e6f4ea,stroke:#2e7d32
style D fill:#fff3e0,stroke:#ef6c00
The key architectural choice — operating in embedding space rather than token space — preserves the semantic difficulty of the benchmark while disrupting the memorization pathway. A model that has seen a benchmark question verbatim during training has learned associations anchored in its embedding representation of that exact surface form; a bounded perturbation shifts that representation enough to break the shortcut without making the question semantically unrecognizable.
Why the method uses a less-contaminated reference model
The reference model is the signal source that makes the perturbation generator's objective meaningful. Without a cleaner comparator, the perturbation generator has no target behavior to align toward — it would be optimizing without direction.
The paper's framing is precise: decontamination is relative rather than absolute. DeconIEP does not claim to restore a model to a hypothetical zero-contamination baseline; it claims to align the contaminated model's outputs closer to those of a reference that has experienced less leakage. This distinction matters for interpretation. If the reference model itself carries some contamination, the effective decontamination ceiling is bounded by the reference's own contamination level, not by zero.
| Reference-model property | What it contributes | Constraint implied |
|---|---|---|
| Less contaminated than evaluated model | Provides a cleaner behavioral target | The decontamination goal is relative, not absolute |
| Frozen or stable during guidance | Keeps the training signal consistent | Training details must be verified in the full paper |
| Benchmark-specific exclusion history | Strengthens trust in the guidance signal | Ambiguous provenance weakens the ceiling |
Pro Tip: When selecting a reference model for DeconIEP, prioritize models trained on corpora with documented exclusion of the target benchmarks — for example, models that explicitly held out MMLU or GSM8K from pretraining, or models trained before benchmark publication dates. A reference with ambiguous contamination history weakens the decontamination signal and narrows the performance gap the method can close.
The abstract does not specify how the reference model is selected, what contamination threshold qualifies it, or whether it is frozen during perturbation generator training. These details require verification in the full paper's implementation section before using the method in a production evaluation pipeline.
What bounded perturbations change in the model's input space
The ℓ∞ norm constraint on the perturbation is not an incidental implementation detail — it is the mechanism that controls the contamination-utility trade-off. An ℓ∞ bound constrains the maximum absolute deviation per embedding dimension, which limits how far any single dimension can drift from the original input representation. This is structurally the same class of constraint used in adversarial robustness research, applied here with the opposite intent: instead of maximizing model error, the perturbation minimizes memorization-driven advantage.
Practically, this means the benchmark question's surface text is unaltered and reproducible across evaluation runs. Two evaluators using DeconIEP with the same perturbation generator, reference model, and budget ε will produce consistent results — a property that detect-then-filter approaches based on probabilistic detection cannot guarantee because those approaches may disagree on which examples to remove.
Watch Out: The ℓ∞ bound also defines the method's ceiling. A heavily contaminated model whose memorization pathways are deeply encoded across many embedding dimensions may require a larger perturbation budget to disrupt effectively — but increasing ε to compensate degrades benign utility. The paper's snippets do not expose the exact ε values or the shape of the decontamination-vs-utility curve; practitioners should not assume a single budget works uniformly across contamination severities. Verify the ε sweep from the experimental ablations in the full paper before tuning the budget for a specific evaluation setting.
Why instance-adaptive perturbations matter for contaminated examples
A global, fixed perturbation applied uniformly to all benchmark examples would overcorrect on clean examples and under-correct on heavily memorized ones. Instance-adaptive perturbation generation addresses this directly: the perturbation magnitude and direction are conditioned on the specific input, allowing the generator to apply stronger corrections where memorization is more salient and lighter corrections where the model's response is closer to the reference behavior.
This design choice is also what allows the paper to claim minimal benign-utility degradation alongside strong decontamination effectiveness. The abstract states: "Across multiple open-weight LLMs and benchmarks, extensive empirical results show that DeconIEP achieves strong decontamination effectiveness while incurring only minimal degradation in benign utility." A uniform perturbation would impose unnecessary input-space distortion on examples that do not need correction, systematically suppressing performance on uncontaminated test items.
Pro Tip: The instance-adaptive design implies that evaluating the perturbation generator's behavior on a held-out clean benchmark is a meaningful diagnostic. If the generator applies near-zero perturbations to inputs from a benchmark the model was not exposed to during training, that is indirect evidence the generator is learning contamination-specific signals rather than indiscriminate input distortion. The paper's ablations should document this; check the supplementary experiments for generator behavior on clean versus contaminated splits.
The exact utility-drop percentage and instance-adaptation ablation numbers are not recoverable from the abstract alone. The full results tables in the paper are required before making quantitative claims about the clean-accuracy cost.
Methodology, datasets, and evaluation setup in the paper
The paper evaluates DeconIEP across multiple open-weight LLMs and multiple benchmarks in an experimental setup designed to separate two measurement objectives: decontamination effectiveness (how much leakage-driven inflation is reduced) and benign utility (how much legitimate task performance is preserved). These are not the same axis, and conflating them would allow a method that simply degrades all model capability to look like it "reduces contamination."
The evaluation protocol defines contamination-inflated performance as the target for reduction, and holds benign utility as a constraint — not a secondary objective to be sacrificed freely.
| Evaluation Axis | What It Measures | Why It Matters |
|---|---|---|
| Decontamination effectiveness | Reduction in leakage-driven score inflation on contaminated examples | Validates that the method corrects the right signal |
| Benign utility | Performance on clean (non-contaminated) examples after perturbation | Confirms the method does not degrade legitimate capability |
| Dual-axis result | Both metrics reported simultaneously | Prevents gaming by one-sided optimization |
Which open-weight models and benchmarks the paper studies
The paper evaluates multiple open-weight LLMs — the abstract does not enumerate model names in the retrieved snippets, and the exact model families require verification from the experimental section of the full PDF. The benchmarks referenced explicitly in the paper include MMLU and GSM8K, two widely-used benchmarks with documented contamination exposure across the broader LLM literature. The semantic entity list for this paper also includes MATH as a benchmark in scope.
| Model / Benchmark scope | Verified fact | Numeric detail |
|---|---|---|
| Open-weight LLMs | Multiple open-weight models are evaluated | Count not enumerated in snippets |
| MMLU | Explicitly referenced benchmark | Public since 2020 |
| GSM8K | Explicitly referenced benchmark | Public since 2021 |
| MATH | Included in semantic scope | Exact experimental inclusion needs full-paper confirmation |
| Additional benchmarks | Not enumerated in snippets | Unknown from abstract alone |
Watch Out: The retrieved snippets do not enumerate all model families or all benchmarks used in the experiments. Any publication-facing claims about specific models (e.g., LLaMA, Mistral, Qwen) require verification from the paper's experimental section. Do not infer model coverage from the semantic entity list alone.
What the evaluation protocol measures beyond raw score inflation
The paper frames decontamination effectiveness and benign utility as complementary but separable measurements. A method that eliminates contamination-driven inflation by making the model fail on all questions satisfies the first metric but fails the second. DeconIEP's dual-axis evaluation is structurally meaningful: it requires the method to reduce the gap between contaminated and reference performance on contaminated examples while preserving performance on examples that were never memorized.
| Protocol Component | Role in Evaluation | Constraint Source |
|---|---|---|
| Contaminated split | Measures inflation reduction | Reference model provides target |
| Clean (benign) split | Measures utility preservation | Perturbation must be bounded |
| ℓ∞ budget ε | Controls the contamination-utility frontier | Ablated in experiments |
| Instance-adaptive generator | Per-example perturbation strength | Trained against reference signal |
The specific metric formulas and threshold definitions are not recoverable from the abstract; the paper's evaluation protocol section contains the authoritative definitions.
What the results say about contamination mitigation and clean accuracy
The paper's abstract makes a directional claim supported by extensive empirical experiments: DeconIEP achieves strong decontamination effectiveness while incurring only minimal degradation in benign utility across multiple open-weight LLMs and benchmarks. This framing positions the method favorably on both axes simultaneously, but the claim is qualified — "strong" and "minimal" are relative terms whose magnitude requires the actual result tables to interpret.
| Reported result dimension | Verified direction | Quantitative detail in snippets |
|---|---|---|
| Decontamination effectiveness | Strong | No per-benchmark delta exposed |
| Benign utility | Minimal degradation | Exact utility-loss percentage not exposed |
| Experimental coverage | Multiple open-weight LLMs and benchmarks | Specific model list not enumerated |
The honest characterization from the available evidence: the paper demonstrates that it is possible to reduce contamination-driven score inflation at inference time without a catastrophic clean-accuracy drop. Whether "minimal" means 0.5 points or 5 points on a given benchmark is a consequential difference for practitioners, and that number requires the full results section.
Where DeconIEP appears strongest across contaminated benchmarks
The abstract's claim of strong decontamination effectiveness across multiple benchmarks is a broad positive result, but the search snippets do not provide per-benchmark rankings or effect sizes. The method is architecturally better positioned to perform well on benchmarks where contamination manifests as pattern-matching shortcuts — settings where the model is retrieving an answer from a memorized surface form rather than reasoning through the problem.
| Benchmark pattern | Expected DeconIEP strength | Verified ordering |
|---|---|---|
| High surface-form memorization | Strongest in principle | Strongest / weakest ordering not recoverable from snippets |
| Paraphrase-level contamination | Moderate in principle | Must be confirmed in the full paper |
| Semantic contamination | Weakest in principle | Must be confirmed in the full paper |
Per-benchmark strongest/weakest performance rankings are not recoverable from the abstract; the full results section provides the authoritative breakdown.
Where the perturbation budget starts to limit gains
The ℓ∞ constraint on perturbation magnitude is both the method's precision instrument and its hard ceiling. As contamination severity increases — meaning the model's memorization of benchmark examples is more deeply encoded across more embedding dimensions — a fixed budget ε becomes insufficient to redirect the model away from the memorized pathway without exceeding the bound.
Watch Out: DeconIEP is not a monotonically improving function of contamination severity. At moderate contamination levels, the bounded perturbation may be able to shift the model off its memorized trajectory while preserving benign utility. At severe contamination — where the model's responses are dominated by recall rather than reasoning — the required perturbation may exceed the budget, leaving residual inflation. The specific ε threshold at which gains taper is not stated in the retrieved snippets; this must be extracted from the experimental ablations. Do not apply a default budget from the paper's main results to a more severely contaminated model without running the ε sweep for that specific setting.
How DeconIEP differs from detect-then-filter evaluation
Detect-then-filter evaluation — the dominant prior approach — proceeds in two stages: first, identify benchmark examples that appear in the model's training data using string matching, n-gram overlap, or learned detectors; second, exclude or reweight those examples before computing the evaluation score. DeconIEP operates on a fundamentally different axis. It does not alter the benchmark composition; it alters the model's input representation during evaluation.
| Dimension | Detect-then-Filter | DeconIEP (Inference-Time) |
|---|---|---|
| Unit of intervention | Benchmark examples (pre-evaluation) | Input embeddings (at evaluation) |
| Benchmark integrity | Reduced — filtered examples excluded from scoring | Preserved — all examples scored |
| Detector requirement | Yes — requires a contamination classifier | No — requires a reference model |
| Reference model required | No | Yes — less-contaminated comparator |
| Error propagation | Detector false negatives leave contamination; false positives reduce valid coverage | Reference model contamination level sets the ceiling |
| Reproducibility | Depends on detector agreement across runs | Deterministic given fixed generator and budget |
The prior inference-time method most directly comparable to DeconIEP is the ITD framework (arXiv:2406.13990), which proposes detecting and rewriting leaked samples without altering their difficulties. ITD still requires a detection step and rewrites the benchmark text; DeconIEP skips detection entirely and operates purely in embedding space without rewriting the prompt.
Why detector errors can fail under moderate contamination
Contamination detectors operate on statistical signals: n-gram overlap, perplexity differentials, or membership inference scores. These signals are strongest when contamination is extreme — an exact benchmark example reproduced verbatim in training data produces a strong overlap signal. At moderate contamination levels, where training data contains paraphrases, reformulations, or partial overlaps with benchmark items, detector precision and recall degrade. The detector may miss contaminated examples (false negatives that leave inflation uncorrected) or flag clean examples (false positives that remove valid evaluation coverage).
| Contamination level | Detector signal strength | Filter reliability | DeconIEP applicability |
|---|---|---|---|
| Extreme (exact duplicates) | High — strong n-gram overlap | Reliable, high recall | Applicable but may be overkill |
| Moderate (paraphrases, variants) | Weak — noisy overlap signal | Unreliable, high false-negative rate | Applicable — instance-adaptive correction |
| Low (thematic exposure only) | Minimal | Filtering removes valid examples | Limited value from either approach |
The paper's framing positions identify-based mitigation as one of two prior lines and implicitly motivates the inference-time approach as more robust to the moderate-contamination regime where detection is unreliable.
When each approach is the better evaluation choice
-
Choose detect-then-filter when contamination is extreme and detectable with high confidence (exact or near-exact duplicates in training), when you have access to the training corpus for direct membership queries, and when excluding contaminated examples from the score does not reduce benchmark coverage below statistical significance thresholds.
-
Choose DeconIEP-style inference-time decontamination when you cannot access the training corpus to run direct membership checks, when contamination is suspected at moderate levels where detectors are unreliable, when benchmark integrity (scoring all examples) matters more than minimizing any possibility of residual inflation, or when you need a reproducible, benchmark-preserving correction that does not vary with detector confidence thresholds.
-
Choose ITD (detect-then-rewrite) when you need to reuse a known-contaminated benchmark but cannot discard examples, and when the rewriting process can be validated to preserve task difficulty — recognizing that any rewrite introduces its own distributional shift.
-
Use no single approach in isolation when contamination rate is unknown; in that scenario, the correct practice is to report scores under multiple conditions (no correction, detect-then-filter, and inference-time decontamination) and present the spread as a bound on the true performance.
| Situation | Preferred approach | Why |
|---|---|---|
| Extreme contamination with corpus access | Detect-then-filter | High-confidence detection is feasible |
| Moderate contamination without corpus access | DeconIEP | Benchmark-preserving correction with reference guidance |
| Known-contaminated benchmark requiring reuse | ITD or DeconIEP | Rewriting or inference-time correction may be appropriate |
| Unknown contamination rate | Report multiple conditions | The spread bounds the likely true performance |
Limitations, caveats, and what the paper does not prove
Watch Out: DeconIEP does not claim to fully eliminate benchmark contamination — and the abstract does not support reading it that way. The method applies small, ℓ∞-bounded perturbations constrained by a fixed budget. "Small" and "bounded" are architectural choices that directly limit the ceiling on contamination reduction. A model whose benchmark memorization is deeply encoded at the representation level may not be fully correctable within the perturbation budget. The paper claims strong effectiveness and minimal utility loss; it does not claim complete elimination of contamination effects or perfect preservation of clean accuracy. These are distinct and weaker claims, and they are the correct claims to cite.
The paper's abstract provides no adversarial or out-of-distribution failure cases, no lower bound on residual contamination after decontamination, and no formal proof of convergence to a contamination-free distribution. The empirical results are described as extensive, but their full scope — model count, benchmark count, contamination severity range — is not exposed in the retrieved snippets and requires the full paper to interpret.
Why the paper is not a universal fix for contaminated evaluation
The two structural constraints that limit universality are explicit in the method design. First, the perturbation budget ε sets a hard ceiling on how much the input representation can be shifted; contamination beyond that ceiling is not fully correctable without increasing ε to a level that degrades benign utility. Second, the reference model's own contamination level sets the ceiling on the decontamination target — a reference model with 20% contamination on a given benchmark cannot guide the evaluated model to a zero-contamination baseline, only to approximately the reference's contamination level.
Watch Out: The claim of "minimal degradation in benign utility" does not specify the magnitude of that degradation in the abstract. For production evaluation pipelines where clean-benchmark scores are used for model selection, even a 1-2 point suppression of benign utility could affect decisions. Before deploying DeconIEP in a model comparison pipeline, verify the exact utility-loss numbers from the paper's results tables and assess whether the magnitude is acceptable for your evaluation stakes.
These constraints do not invalidate the method — they define its operating regime. DeconIEP is a bounded correction, not a benchmark replacement.
What still needs verification from the full implementation section
The abstract confirms evaluation across multiple open-weight LLMs and benchmarks but omits the implementation details needed for reproduction or cost estimation.
Production Note: Before adopting DeconIEP in an evaluation pipeline, verify the following from the paper's implementation and appendix sections: GPU type and count used for perturbation generator training; number of training steps and convergence criteria for the generator; inference overhead per example (forward-pass cost of the perturbation generator at evaluation time); the exact ε values used in the main experiments and the ε sweep range; and whether the perturbation generator must be retrained per model pair or generalizes across evaluated models with the same reference. The abstract-level source does not expose any of these values.
Implications for practitioners running benchmarked evaluations
The DeconIEP paper shifts the default assumption about where contamination mitigation belongs in an evaluation pipeline. Prior practice treated decontamination as a preprocessing problem — something done to the dataset before evaluation begins. Inference-time decontamination reframes it as an evaluation-time measurement correction, analogous to applying calibration to a biased sensor rather than replacing the sensor's input.
For practitioners who evaluate open-weight models on standard benchmarks, the implication is that contamination mitigation and evaluation can be decoupled from dataset curation decisions. A team that does not control the model's training corpus — which is the common case for evaluators using third-party models — now has a benchmark-preserving correction path that does not require training-data access.
Bottom Line (Practitioner Summary): If you evaluate open-weight models on benchmarks like MMLU or GSM8K and suspect leakage, DeconIEP provides a benchmark-preserving correction path that does not require training-corpus access or benchmark rewriting. The method is not free — it requires a reference model, a perturbation generator training step, and acceptance of a bounded (but non-zero) benign-utility cost. Report both contaminated and decontaminated scores, and disclose the reference model and perturbation budget in the evaluation methodology.
| Scenario | Recommended Action |
|---|---|
| Third-party open-weight model, no corpus access, standard benchmark | Consider DeconIEP with a documented reference model; report dual scores |
| Own model, full corpus access, standard benchmark | Run direct membership queries first; use detect-then-filter for confirmed duplicates |
| Known-contaminated benchmark, reuse required | Evaluate ITD (arXiv:2406.13990) for rewriting approach; compare with DeconIEP results |
| Low contamination suspected, clean corpus | Standard evaluation; document corpus provenance; no correction required |
How to decide between cleaning data, scoring examples, and decontaminating at test time
Three remediation strategies exist in the current literature, each appropriate to a different situation:
-
Clean data upstream (pre-training deduplication): Appropriate when you control the training corpus and can run benchmark deduplication before training begins. This is the only approach that eliminates contamination structurally. Cost: high (requires corpus access and deduplication tooling); benefit: permanent.
-
Score or reweight examples (detect-then-filter at evaluation): Appropriate when you have partial training corpus access or can run statistical membership inference. Reliable only at high contamination severity. Cost: moderate (detector training or n-gram scanning); benefit: moderate, degrades at moderate contamination.
-
Decontaminate at test time (DeconIEP): Appropriate when you lack corpus access, contamination is suspected but not confirmed at extreme levels, or benchmark integrity must be preserved. Cost: requires reference model and perturbation generator; benefit: benchmark-preserving correction with quantified utility trade-off.
| Scenario | Corpus Access | Contamination Severity | Recommended Approach |
|---|---|---|---|
| Pre-training stage | Full | Any | Data cleaning (deduplication) |
| Post-training evaluation, corpus accessible | Full or partial | Extreme (exact duplicates) | Detect-then-filter |
| Post-training evaluation, corpus inaccessible | None | Unknown or moderate | DeconIEP |
| Benchmark reuse for known-contaminated model | None | Confirmed | ITD rewriting or DeconIEP |
How to report results honestly when contamination is suspected
When contamination is suspected but unconfirmed, the baseline reporting standard is to present the observed benchmark score alongside a contamination disclosure statement — the training corpus provenance, any known overlap with benchmark release dates, and any mitigation steps applied. If DeconIEP is applied, the report should include: the reference model identity and its own contamination status, the perturbation budget ε, both the raw (contaminated) and decontaminated scores, and the observed benign-utility delta on a clean held-out set.
Pro Tip: Treat the gap between raw and decontaminated scores as a diagnostic signal, not just a correction. A large gap suggests the model's benchmark performance is substantially contamination-driven and the raw score should not be used as evidence of generalization. A small gap suggests contamination is not the dominant driver of the observed performance, and the raw score is more defensible. Reporting both scores with the gap prominently disclosed is strictly more informative than reporting either alone.
Transparent reporting also means documenting benign utility: if DeconIEP suppresses clean-benchmark accuracy by a measurable margin, that suppression should appear in the evaluation report. Omitting the utility cost while advertising the decontamination effectiveness is a form of selective reporting that the paper's own dual-axis framing explicitly works against.
| Reporting element | Required when DeconIEP is used | Why it matters |
|---|---|---|
| Reference model identity | Yes | Contextualizes the guidance signal |
| Reference contamination status | Yes | Sets the effective ceiling |
| Perturbation budget ε | Yes | Defines the correction strength |
| Raw and decontaminated scores | Yes | Shows the effect size transparently |
| Clean-held-out utility delta | Yes | Reveals the cost of correction |
Frequently asked questions on benchmark contamination and DeconIEP
What is benchmark contamination in LLMs?
Benchmark contamination occurs when test samples from an evaluation benchmark — or close variants of those samples — appear in the model's training data. The model learns the correct answers during training rather than at evaluation time, inflating reported scores without reflecting genuine generalization capability. The paper documents contamination rates from 1% to 45% across six benchmarks and over 15 LLMs.
How does inference-time decontamination work?
DeconIEP applies small, ℓ∞-bounded perturbations to input embeddings during evaluation. A perturbation generator, trained against the signal from a relatively less-contaminated reference model, produces instance-specific embedding shifts that steer the evaluated model away from memorization-driven shortcuts while preserving the benchmark prompt verbatim.
Does decontamination reduce model accuracy?
The paper reports minimal degradation in benign utility alongside strong decontamination effectiveness, but does not quantify the magnitude of "minimal" in the abstract. The utility cost is real and bounded by the perturbation budget; it is not zero. Exact numbers require the full results tables.
What is the difference between contamination detection and decontamination?
Detection identifies which benchmark examples are contaminated; filtering removes them from the evaluation set. Decontamination, as implemented in DeconIEP, corrects contamination effects during the evaluation forward pass without altering the benchmark's composition. Detection is upstream; decontamination is concurrent with evaluation.
Can benchmark contamination be fully removed?
Not by DeconIEP, and the paper does not claim otherwise. The method is bounded by the perturbation budget and the reference model's own contamination level. It reduces contamination-driven inflation substantially but does not restore the model to a provably zero-contamination state. Full removal requires training-data deduplication before pretraining — a fundamentally different intervention.
Bottom Line: DeconIEP is a practically deployable, benchmark-preserving decontamination tool for cases where training-corpus access is unavailable. It is not a universal fix, does not claim complete contamination elimination, and requires careful reference-model selection and perturbation-budget tuning to deploy responsibly. Its most significant contribution is shifting the intervention point from data curation to the evaluation forward pass — a meaningful change for evaluation practitioners who do not control the models they assess.
Sources & References
- When Benchmarks Leak: Inference-Time Decontamination for LLMs — Jianzhe Chai et al. (arXiv:2601.19334) — Primary source; introduces DeconIEP, the inference-time embedding-perturbation decontamination framework evaluated across multiple open-weight LLMs and benchmarks.
- When Benchmarks Leak — Full PDF — PDF version; contains implementation details, experimental setup, and result tables not fully exposed in the abstract.
- Inference-Time Decontamination: Reusing Leaked Benchmarks for Testing Large Language Models (arXiv:2406.13990) — Prior inference-time decontamination paper (ITD); proposes detecting and rewriting leaked samples without altering difficulty; baseline for comparison with DeconIEP.
Keywords: DeconIEP, When Benchmarks Leak: Inference-Time Decontamination for LLMs, benchmark contamination, test-set contamination, inference-time decontamination, detect-then-filter evaluation, open-weight LLMs, embedding-space perturbation, reference model, GSM8K, MATH, MMLU, arXiv 2601.19334, benchmark leakage, benign utility

