How We Compared Ambrosia and deduplicate-text-datasets
The comparison rests on four criteria that matter when building an LLM training pipeline: algorithmic exactness, scale limits, operational complexity, and reproducibility. A tool that runs cleanly on small corpora is a different product from one engineered for C4-scale jobs, and conflating them wastes engineering time in both directions.
Google's deduplicate-text-datasets is the implementation released with the paper "Deduplicating Training Data Makes Language Models Better". As the README states: "This repository contains the code and scripts used in the paper 'Deduplicating Training Data Makes Language Models Better.'" It ships two deduplication modes — ExactSubstr (exact substring, Rust-based) and NearDup (near-duplicate clustering) — and releases pre-computed document clusters for C4, RealNews, LM1B, and Wiki-4B-en. The repo is archived and not an officially supported Google product, but its methodology and released artifacts make it the reference implementation in this comparison.
Ambrosia is a Go package published under github.com/reactorsh/ambrosia. Its public interface exposes CLI-style flags (--baseurl, --instruction), signaling a workflow optimized for fast adoption over configurable algorithmic depth. No peer-reviewed paper backs it; no hardware benchmarks appear in its public docs. The comparison is therefore asymmetric by design: Ambrosia trades rigor for ergonomics, and the right choice depends entirely on where your dataset sits in the spectrum from prototype corpus to publication-grade training set.
| Criterion | Ambrosia | deduplicate-text-datasets |
|---|---|---|
| Algorithmic exactness | Not publicly documented | ExactSubstr (suffix-array, Rust) + NearDup clustering |
| Near-duplicate support | Not verified from source | Yes — pre-computed clusters for C4, RealNews, LM1B, Wiki-4B-en |
| Language / runtime | Go | Rust + C compiler required |
| Scale ceiling (documented) | Not published | ~300 GB with 96 cores, >600 GB RAM, >1 TB disk |
| Reproducibility | Not benchmarked publicly | Tied to a published paper + released cluster outputs |
| Operational complexity | Low (CLI-first, single binary) | High (Rust build chain, Python deps for demos) |
| Official support | Package on pkg.go.dev | Archived GitHub repo, no official Google support |
At-a-Glance Comparison Table
Exact and near-duplicate deduplication solve different problems. Exact substring deduplication removes passages that appear verbatim (or nearly verbatim at the character level) across multiple documents — the suffix-array-based ExactSubstr approach in Google's repo identifies these efficiently even at hundreds of GB. Near-duplicate deduplication clusters documents that are semantically or structurally similar but not byte-identical — the NearDup component in the repo outputs those clusters for downstream filtering decisions. Most pipelines need both: exact dedup catches boilerplate, legal footers, and scraped duplicates; near-dedup catches paraphrased or lightly reformatted copies that inflate apparent dataset diversity without adding signal.
Ambrosia operates as a lightweight Go package without published documentation of which deduplication class it implements or how it scales. Treating it as a substitute for ExactSubstr on research corpora requires assumptions the public docs do not support.
| Feature | Ambrosia | deduplicate-text-datasets |
|---|---|---|
| Exact substring dedup | Not documented | ✅ ExactSubstr in Rust |
| Near-duplicate clustering | Not documented | ✅ NearDup; clusters released for C4, RealNews, LM1B, Wiki-4B-en |
| Language / runtime | Go | Rust + C compiler |
| Small-corpus hardware | Modern machine, low overhead | ~16 GB RAM, few CPU cores |
| Large-corpus hardware | Not published | 96 cores, >600 GB RAM, >1 TB disk for ~300 GB data |
| Paper-backed methodology | No | Yes — arXiv:2107.06499 |
Ambrosia: where lightweight packaging helps
Ambrosia packages as a standard Go module (github.com/reactorsh/ambrosia), which means installation is a single go install or go get without managing a Rust toolchain, a C compiler, or a Python virtual environment with TensorFlow dependencies. Its CLI surface — flags like --baseurl and --instruction — signals that the intended workflow is scripted pipelines and quick iteration rather than configurable algorithmic depth.
For teams already running Go services or building data pipelines where the dedup step is a pre-filter rather than the primary quality gate, Ambrosia's packaging model eliminates meaningful friction. The tradeoff is that no public documentation quantifies its accuracy, memory footprint, or scale ceiling, so the operational envelope is inferred from its design rather than measured.
| Ambrosia characteristic | Detail |
|---|---|
| Runtime | Go — single binary, no Rust or C build step |
| Interface | CLI-centric (--baseurl, --instruction flags) |
| Packaging | Standard Go module on pkg.go.dev |
| Best-fit corpus size | Small to medium (exact ceiling not published) |
| Reproducibility | Not benchmarked against published baselines |
| Integration burden | Low — fits naturally into Go-based data pipelines |
Best-fit use cases for small and medium corpora
Ambrosia fits cleanly when the deduplication step is a sanity filter on a corpus you assembled yourself — internal documents, domain-specific crawls under a few GB, or synthetic instruction sets where you control the source. In these scenarios, catching obvious verbatim duplicates matters more than guaranteeing a provably complete near-duplicate sweep, and setup time is a real cost.
Pro Tip: For prototyping a new domain-specific fine-tune, a fast CLI dedup pass that removes blatant repeats is usually enough to unblock training. Invest in research-grade dedup only once you've confirmed the dataset composition, not before. Ambrosia's Go packaging means you can wire it into an existing pipeline in under an hour — use that advantage early, and switch tools when scale or reproducibility requirements harden.
Where Ambrosia starts to fall short
Ambrosia's limitations surface in two scenarios: corpora above the scale where its undocumented memory model breaks down, and pipelines where deduplication decisions must be auditable or reproducible across team members. Neither threshold is published in Ambrosia's public docs, which itself is the problem.
Watch Out: For corpora above ~10 GB, or any pipeline destined for a published model or dataset release, Ambrosia's lack of documented scale limits and published accuracy creates an unquantifiable risk. Near-duplicate clustering in particular requires algorithms (MinHashLSH, SimHash, or suffix-array methods) whose behavior at scale is architecture-dependent. The Google repo's requirements — 96 cores, >600 GB RAM, and >1 TB disk for ~300 GB of data — exist because the problem is genuinely hard at that scale; a lightweight Go binary with no published benchmarks should not be assumed to handle it differently without evidence.
The RAM question matters concretely: if your corpus is approaching C4 scale (~300 GB), you need a tool whose hardware requirements are documented and whose algorithm is auditable. Ambrosia provides neither.
google-research/deduplicate-text-datasets: the research-grade option
The google-research/deduplicate-text-datasets repository gives practitioners two things most dedup tools do not: a published algorithmic methodology tied to a peer-reviewed paper, and pre-computed clustering outputs for widely used corpora. The README is explicit: "We also release the document clusters resulting from running NearDup deduplication on C4, RealNews, LM1B, and Wiki-4B-en." Teams training on any of those corpora can consume the released clusters directly without re-running the full pipeline.
The repo's ExactSubstr component is implemented in Rust for memory safety and performance. Its NearDup component produces document-level cluster outputs, not just a deduplicated corpus, which means downstream teams can implement custom merge strategies rather than accepting a binary keep/drop decision. The build prerequisites — Rust toolchain, C compiler, and Python dependencies used for demos and dataset-generation scripts, including numpy, scipy, sentencepiece, and a TensorFlow requirements file — reflect the research provenance: this was built to be correct and reproducible, not to minimize setup time.
One material operational caveat: the repository is archived on GitHub and explicitly noted as not an officially supported Google product. Teams depending on it for production pipelines inherit maintenance responsibility.
| Component | Detail |
|---|---|
| ExactSubstr | Rust implementation; removes exact or near-exact substring matches at corpus scale |
| NearDup | Produces document clusters; pre-computed outputs released for C4, RealNews, LM1B, Wiki-4B-en |
| Inspection scripts | Python scripts for inspecting and analyzing dedup results |
| Build requirements | Rust toolchain + C compiler; Python deps for demos and dataset generation (numpy, scipy, sentencepiece, requirements-tf.txt) |
| Repo status | Archived; not officially supported by Google |
| Paper | "Deduplicating Training Data Makes Language Models Better", arXiv:2107.06499 |
ExactSubstr and near-duplicate clustering in the repo
ExactSubstr locates all substrings above a configurable length threshold that appear in more than one document, then removes or flags them. The implementation uses a suffix-array construction over the concatenated corpus, which scales to hundreds of GB at the cost of proportionally large RAM requirements. NearDup operates at the document level, grouping documents by approximate similarity and releasing those cluster assignments for downstream use.
The distinction matters operationally: ExactSubstr is a filter (it removes content); NearDup is a classifier (it groups documents and leaves the merge decision to the user). Most research pipelines use both — ExactSubstr to remove verbatim boilerplate and NearDup to collapse paraphrased or reformatted duplicates that inflate dataset size without diversity.
Pro Tip: Exact substring removal alone is insufficient for scraped web corpora. A paragraph scraped from 500 mirror sites will have minor HTML-introduced variation — different whitespace, truncated sentences — that ExactSubstr will not catch as duplicates. NearDup clustering handles this class of near-identical content. On C4, the paper reports that near-duplicate removal substantially reduced train-test overlap; exact dedup alone would have missed a meaningful fraction of those cases.
What the README implies about scale and hardware
The README's hardware guidance is the clearest signal of the tool's target operating regime. Small datasets — on the order of 10 GB — run on a modern machine with roughly 16 GB RAM and a few CPU cores. Once the corpus approaches C4 scale, the requirements escalate sharply.
| Corpus size | CPU cores | RAM | Disk |
|---|---|---|---|
| ~10 GB (small) | A few cores | ~16 GB | Proportional to corpus |
| ~300 GB (C4 scale) | 96 cores (recommended) | >600 GB | >1 TB |
These numbers are not conservative estimates padded for safety — they reflect the actual hardware the Google Research team used. The suffix-array construction for ExactSubstr loads the full corpus into memory in sorted form; at 300 GB of raw text, the in-memory data structures exceed what a standard cloud instance provides. Teams planning C4-scale jobs should budget accordingly, either for a high-memory bare-metal instance or a managed distributed processing service.
Benchmarks and operational trade-offs that matter in practice
The core competitive gap between these tools is not throughput — it is algorithmic completeness and the downstream impact on model quality. The paper backing Google's repo provides the most direct evidence: "Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy." In the paper's reported results, deduplication also reduces perplexity by up to 10% in some cases, and train-test overlap decreases by more than 4% of the validation set on standard datasets.
These numbers are not achievable with a dedup pass that misses near-duplicate content. A tool whose algorithm is undocumented cannot make reproducible guarantees about which of those gains it captures. Ambrosia's lighter packaging is a genuine advantage for fast iteration; it is not a substitute for ExactSubstr + NearDup when dataset quality is the primary variable being controlled.
| Metric | Google repo (paper-reported) | Ambrosia |
|---|---|---|
| Memorization reduction | 10× fewer memorized emissions | Not benchmarked |
| Perplexity improvement | Up to 10% in some cases | Not benchmarked |
| Train-test overlap reduction | >4% of validation set | Not benchmarked |
| C4-scale hardware (verified) | 96 cores, >600 GB RAM, >1 TB disk | Not documented |
| Small-corpus hardware | ~16 GB RAM, few cores | Not documented |
| Setup time | High (Rust + C + Python deps) | Low (Go module, single binary) |
No published head-to-head benchmark compares Ambrosia and the Google repo on the same corpus. The operational comparison is therefore anchored to verified constraints from each tool's public documentation rather than a controlled experiment.
Small-corpus workflow versus research-corpus workflow
The workflow divergence maps cleanly to corpus size.
| Decision dimension | Small corpus (≤~10 GB) | Research corpus (~300 GB, C4 scale) |
|---|---|---|
| Dataset size | Internal, self-assembled, prototype-scale | Public benchmark, release-scale, or model-training artifact |
| Reproducibility need | Team-internal only | External or publication-grade |
| Recommended tool | Ambrosia (or Google repo) | Google repo only |
| Setup effort | Minutes (Go install) vs. hours (Rust build + Python env) | Google repo; setup cost is fixed, not per-run |
| RAM requirement | ~16 GB sufficient for Google repo | >600 GB for Google repo |
| Disk requirement | Proportional to corpus | >1 TB |
| CPU requirement | A few cores | 96 cores recommended |
| Near-dedup support | Ambrosia: unknown; Google repo: yes | Google repo: yes |
| Reproducibility | Ambrosia: not documented; Google repo: paper-anchored | Google repo: paper-anchored |
| Publication-grade use | Usually unnecessary | Required |
For small corpora where speed of setup is the binding constraint and the corpus is self-assembled (not a public benchmark dataset), Ambrosia's Go packaging means a lower total setup cost. For anything requiring auditability, large scale, or both, the Google repo is the only verified option of the two.
Where exactness beats convenience
Research-grade dedup becomes mandatory — not optional — when the dataset is destined for a published model, a public benchmark release, or any training run where memorization and train-test contamination are evaluation criteria.
Pro Tip: If your model will be evaluated on benchmarks that share source material with your training corpus (a near-universal condition for web-scraped data), running only a lightweight dedup pass creates a contamination risk you cannot quantify post-hoc. The Google paper quantifies the impact directly: models trained on deduplicated data emit memorized content ten times less often and show measurably lower train-test overlap. That gap exists regardless of which convenience tool you used upstream — the contamination is already in the weights.
The suffix-array approach in ExactSubstr finds duplicates that hash-based methods miss, particularly for short repeated substrings embedded in longer documents. When a dataset is the product — the artifact that determines the model's legal and empirical properties — algorithmic completeness outweighs setup friction by a wide margin.
Decision matrix: which pipeline should you choose?
| Scenario | Dataset size | Reproducibility | Operational burden | Choice |
|---|---|---|---|---|
| Internal prototype corpus | ≤ ~10 GB | Team-internal only | Low, Go-first integration | ✅ Ambrosia is sufficient |
| Domain crawl or private fine-tune set | ~10–100 GB | Helpful but not externally audited | Moderate; hardware requirements are unknown for Ambrosia | ⚠️ Google repo if auditability matters |
| Publication-grade training data | ~300 GB (C4 scale) | Required | High; 96 cores, >600 GB RAM, >1 TB disk | ✅ deduplicate-text-datasets |
| Public benchmark or dataset release | Any size with contamination risk | Required | High; released cluster outputs and paper trail matter | ✅ deduplicate-text-datasets |
| Rapid experimentation | Small to medium | Team-internal only | Low | ✅ Ambrosia |
| Model evaluation tied to memorization and overlap | Any corpus with shared source material | Required | High | ✅ deduplicate-text-datasets |
| Scenario | Ambrosia | deduplicate-text-datasets |
|---|---|---|
| Corpus size ≤ ~10 GB, internal use | ✅ Sufficient | ✅ Also viable; higher setup cost |
| Corpus size ~10–100 GB | ⚠️ Scale limits unknown | ✅ Viable on high-memory instance |
| Corpus size ~300 GB (C4 scale) | ❌ Not documented for this scale | ✅ Designed for this; 96 cores, >600 GB RAM, >1 TB disk |
| Reproducibility required | ❌ Not benchmarked publicly | ✅ Tied to published paper |
| Near-duplicate clustering needed | ❌ Not documented | ✅ NearDup with released cluster outputs |
| Publication or dataset release | ❌ Insufficient auditability | ✅ Mandatory |
| Prototype / rapid iteration | ✅ Fast adoption, low friction | ⚠️ High setup cost for prototyping |
| Go-based pipeline integration | ✅ Native | ❌ Requires Rust + C + Python |
| Team has Rust build capacity | Irrelevant | ✅ Prerequisite met |
Choose Ambrosia when speed of adoption matters more than maximal rigor
Ambrosia is the right choice when the dedup step is a pre-filter on a small, self-assembled corpus where no published baseline or external evaluation will scrutinize duplicate rates. Its Go packaging means a data engineer can wire it into an existing pipeline without acquiring Rust toolchain expertise or managing a Python environment with TensorFlow dependencies. For internal fine-tune experiments, domain adaptation on proprietary corpora under ~10 GB, or rapid prototyping where the dataset composition will change before training begins, Ambrosia's ergonomics deliver real time savings.
The key condition: your evaluation criteria do not depend on provably complete near-duplicate removal, and reproducibility is a team-internal concern rather than an external standard.
Choose google-research/deduplicate-text-datasets when the dataset is the product
When the deduplicated corpus is itself the artifact — a public release, a training set for a published model, or a benchmark dataset — the Google repo is the only verified option of the two. Its ExactSubstr implementation removes verbatim and near-verbatim substring repetition with a documented algorithm; its NearDup component produces auditable cluster outputs that teams can inspect, reproduce, and cite. The paper "Deduplicating Training Data Makes Language Models Better" provides the empirical grounding: 10× reduction in memorized emissions, up to 10% perplexity improvement, and >4% reduction in train-test overlap on standard datasets.
The operational cost is real: Rust toolchain, C compiler, Python dependencies, and for C4-scale work, a 96-core machine with more than 600 GB of RAM and a terabyte of disk. Teams without that infrastructure need to provision it. The repo is also archived and carries no official Google support commitment — factor in that the maintenance burden falls on the consuming team.
FAQ
What is text deduplication in NLP?
Text deduplication identifies and removes repeated content from a corpus before it is used for model training. Repetitions range from exact byte-level matches (the same paragraph scraped from 200 mirror sites) to near-duplicate documents that differ only in whitespace, punctuation, or minor paraphrasing. Removing them prevents models from memorizing repeated passages, reduces training steps wasted on redundant data, and improves the validity of held-out evaluation sets.
Why is deduplication important for LLM training data?
Without deduplication, a language model disproportionately memorizes the most frequent content in its training set — boilerplate legal text, SEO-farmed articles, templated content. The paper backing Google's repo quantifies this: models trained on deduplicated data emit memorized text ten times less frequently and can achieve the same or better perplexity in fewer training steps. Train-test overlap — where benchmark examples appear verbatim in training data — also decreases by more than 4% of the validation set, directly improving evaluation integrity.
What is the difference between exact and near-duplicate text deduplication?
Exact deduplication (ExactSubstr) identifies substrings or documents that are byte-for-byte identical above a configurable length threshold. Near-duplicate deduplication (NearDup) clusters documents that are similar but not identical, catching paraphrased or lightly reformatted copies. Most production pipelines apply both: exact dedup is faster and fully deterministic; near-dedup requires approximate similarity computation (via MinHashLSH, SimHash, or suffix-array methods) and involves a similarity threshold decision that affects recall.
How much RAM do I need for large-scale text deduplication?
For small corpora (~10 GB), roughly 16 GB of RAM and a few CPU cores suffice when using the Google repo's ExactSubstr. At C4 scale (~300 GB), the repo's README documents 96 cores, more than 600 GB of RAM, and more than 1 TB of disk. The suffix-array construction loads the concatenated corpus into memory in sorted form, which drives the RAM requirement up proportionally with corpus size. Ambrosia does not publish memory requirements for any corpus size.
Is MinHash or SimHash better for text deduplication?
Neither dominates across all regimes. MinHashLSH provides probabilistically bounded false-negative rates and is well-suited to document-level near-duplicate detection at scale. SimHash is faster to compute and works well for shorter documents and deduplication at the paragraph or sentence level. The choice depends on document length distribution, acceptable false-negative rate, and compute budget. No verified head-to-head benchmark between the two was retrieved for this article; consult the deduplication literature directly for corpus-specific guidance.
Pro Tip: Unsure which tool to pick? Return to the decision matrix above and answer two questions first: (1) Is this corpus destined for a published model or external evaluation? (2) Is the corpus larger than ~10 GB? If either answer is yes, the Google repo is the safer choice regardless of setup friction.
Sources & References
| Source | Use in article |
|---|---|
| google-research/deduplicate-text-datasets | Primary repository source for ExactSubstr, NearDup, released cluster outputs, and hardware guidance |
| "Deduplicating Training Data Makes Language Models Better" (arXiv:2107.06499) | Paper backing the memorization, overlap, and training-step claims |
| arXiv:2107.06499 PDF | Source for the perplexity result cited in the benchmarks section |
| Ambrosia Go package — pkg.go.dev | Official package documentation for Ambrosia and its CLI/package surface |
- google-research/deduplicate-text-datasets — Primary source: GitHub repository containing ExactSubstr and NearDup deduplication code, scripts, and released cluster outputs for C4, RealNews, LM1B, and Wiki-4B-en. Archived; not officially supported by Google.
- "Deduplicating Training Data Makes Language Models Better" (arXiv:2107.06499) — Peer-reviewed paper providing empirical grounding for the Google repo's methodology, including memorization reduction and perplexity results.
- arXiv:2107.06499 PDF — Source for the "up to 10% perplexity reduction" figure cited in the benchmarks section.
- Ambrosia Go package — pkg.go.dev — Official package documentation for Ambrosia (
github.com/reactorsh/ambrosia), the source for CLI flag details and packaging model.
Keywords: Ambrosia | google-research/deduplicate-text-datasets | text deduplication | ExactSubstr | NearDup | MinHashLSH | SimHash | suffix array deduplication | Rust | Go | C4 | RealNews | LM1B | Wiki-4B-en | LLM training data | near-duplicate clustering | exact substring deduplication | HuggingFace Datasets | NVIDIA H100



