AI & ML

Ambrosia vs Google's deduplicate-text-datasets: choosing a text-dedup pipeline for LLM training data

Google’s deduplicate-text-datasets provides exact substring deduplication in Rust plus near-duplicate clustering for large corpora, while Ambrosia is a lightweight package aimed at ergonomics — but the deciding constraint is scale and rigor, because Google’s repo is built for research-grade dataset deduplication with very large-memory jobs, whereas simpler tools trade accuracy and reproducibility for convenience.

By AxiomLogica Editorial

Jun 1, 202619 min read

Reviewed by Editorial

Ambrosia vs Google's deduplicate-text-datasets: choosing a text-dedup pipeline for LLM training data

How We Compared Ambrosia and deduplicate-text-datasets

The comparison rests on four criteria that matter when building an LLM training pipeline: algorithmic exactness, scale limits, operational complexity, and reproducibility. A tool that runs cleanly on small corpora is a different product from one engineered for C4-scale jobs, and conflating them wastes engineering time in both directions.

Google's deduplicate-text-datasets is the implementation released with the paper "Deduplicating Training Data Makes Language Models Better". As the README states: "This repository contains the code and scripts used in the paper 'Deduplicating Training Data Makes Language Models Better.'" It ships two deduplication modes — ExactSubstr (exact substring, Rust-based) and NearDup (near-duplicate clustering) — and releases pre-computed document clusters for C4, RealNews, LM1B, and Wiki-4B-en. The repo is archived and not an officially supported Google product, but its methodology and released artifacts make it the reference implementation in this comparison.

Ambrosia is a Go package published under github.com/reactorsh/ambrosia. Its public interface exposes CLI-style flags (--baseurl, --instruction), signaling a workflow optimized for fast adoption over configurable algorithmic depth. No peer-reviewed paper backs it; no hardware benchmarks appear in its public docs. The comparison is therefore asymmetric by design: Ambrosia trades rigor for ergonomics, and the right choice depends entirely on where your dataset sits in the spectrum from prototype corpus to publication-grade training set.

Criterion	Ambrosia	deduplicate-text-datasets
Algorithmic exactness	Not publicly documented	ExactSubstr (suffix-array, Rust) + NearDup clustering
Near-duplicate support	Not verified from source	Yes — pre-computed clusters for C4, RealNews, LM1B, Wiki-4B-en
Language / runtime	Go	Rust + C compiler required
Scale ceiling (documented)	Not published	~300 GB with 96 cores, >600 GB RAM, >1 TB disk
Reproducibility	Not benchmarked publicly	Tied to a published paper + released cluster outputs
Operational complexity	Low (CLI-first, single binary)	High (Rust build chain, Python deps for demos)
Official support	Package on pkg.go.dev	Archived GitHub repo, no official Google support

At-a-Glance Comparison Table

Exact and near-duplicate deduplication solve different problems. Exact substring deduplication removes passages that appear verbatim (or nearly verbatim at the character level) across multiple documents — the suffix-array-based ExactSubstr approach in Google's repo identifies these efficiently even at hundreds of GB. Near-duplicate deduplication clusters documents that are semantically or structurally similar but not byte-identical — the NearDup component in the repo outputs those clusters for downstream filtering decisions. Most pipelines need both: exact dedup catches boilerplate, legal footers, and scraped duplicates; near-dedup catches paraphrased or lightly reformatted copies that inflate apparent dataset diversity without adding signal.

Ambrosia operates as a lightweight Go package without published documentation of which deduplication class it implements or how it scales. Treating it as a substitute for ExactSubstr on research corpora requires assumptions the public docs do not support.

Feature	Ambrosia	deduplicate-text-datasets
Exact substring dedup	Not documented	✅ ExactSubstr in Rust
Near-duplicate clustering	Not documented	✅ NearDup; clusters released for C4, RealNews, LM1B, Wiki-4B-en
Language / runtime	Go	Rust + C compiler
Small-corpus hardware	Modern machine, low overhead	~16 GB RAM, few CPU cores
Large-corpus hardware	Not published	96 cores, >600 GB RAM, >1 TB disk for ~300 GB data
Paper-backed methodology	No	Yes — arXiv:2107.06499

Ambrosia: where lightweight packaging helps

Ambrosia packages as a standard Go module (github.com/reactorsh/ambrosia), which means installation is a single go install or go get without managing a Rust toolchain, a C compiler, or a Python virtual environment with TensorFlow dependencies. Its CLI surface — flags like --baseurl and --instruction — signals that the intended workflow is scripted pipelines and quick iteration rather than configurable algorithmic depth.

For teams already running Go services or building data pipelines where the dedup step is a pre-filter rather than the primary quality gate, Ambrosia's packaging model eliminates meaningful friction. The tradeoff is that no public documentation quantifies its accuracy, memory footprint, or scale ceiling, so the operational envelope is inferred from its design rather than measured.

Ambrosia characteristic	Detail
Runtime	Go — single binary, no Rust or C build step
Interface	CLI-centric (--baseurl, --instruction flags)
Packaging	Standard Go module on pkg.go.dev
Best-fit corpus size	Small to medium (exact ceiling not published)
Reproducibility	Not benchmarked against published baselines
Integration burden	Low — fits naturally into Go-based data pipelines

Best-fit use cases for small and medium corpora

Ambrosia fits cleanly when the deduplication step is a sanity filter on a corpus you assembled yourself — internal documents, domain-specific crawls under a few GB, or synthetic instruction sets where you control the source. In these scenarios, catching obvious verbatim duplicates matters more than guaranteeing a provably complete near-duplicate sweep, and setup time is a real cost.

Pro Tip: For prototyping a new domain-specific fine-tune, a fast CLI dedup pass that removes blatant repeats is usually enough to unblock training. Invest in research-grade dedup only once you've confirmed the dataset composition, not before. Ambrosia's Go packaging means you can wire it into an existing pipeline in under an hour — use that advantage early, and switch tools when scale or reproducibility requirements harden.

Where Ambrosia starts to fall short

Ambrosia's limitations surface in two scenarios: corpora above the scale where its undocumented memory model breaks down, and pipelines where deduplication decisions must be auditable or reproducible across team members. Neither threshold is published in Ambrosia's public docs, which itself is the problem.

Watch Out: For corpora above ~10 GB, or any pipeline destined for a published model or dataset release, Ambrosia's lack of documented scale limits and published accuracy creates an unquantifiable risk. Near-duplicate clustering in particular requires algorithms (MinHashLSH, SimHash, or suffix-array methods) whose behavior at scale is architecture-dependent. The Google repo's requirements — 96 cores, >600 GB RAM, and >1 TB disk for ~300 GB of data — exist because the problem is genuinely hard at that scale; a lightweight Go binary with no published benchmarks should not be assumed to handle it differently without evidence.

The RAM question matters concretely: if your corpus is approaching C4 scale (~300 GB), you need a tool whose hardware requirements are documented and whose algorithm is auditable. Ambrosia provides neither.

google-research/deduplicate-text-datasets: the research-grade option

The google-research/deduplicate-text-datasets repository gives practitioners two things most dedup tools do not: a published algorithmic methodology tied to a peer-reviewed paper, and pre-computed clustering outputs for widely used corpora. The README is explicit: "We also release the document clusters resulting from running NearDup deduplication on C4, RealNews, LM1B, and Wiki-4B-en." Teams training on any of those corpora can consume the released clusters directly without re-running the full pipeline.

The repo's ExactSubstr component is implemented in Rust for memory safety and performance. Its NearDup component produces document-level cluster outputs, not just a deduplicated corpus, which means downstream teams can implement custom merge strategies rather than accepting a binary keep/drop decision. The build prerequisites — Rust toolchain, C compiler, and Python dependencies used for demos and dataset-generation scripts, including numpy, scipy, sentencepiece, and a TensorFlow requirements file — reflect the research provenance: this was built to be correct and reproducible, not to minimize setup time.

One material operational caveat: the repository is archived on GitHub and explicitly noted as not an officially supported Google product. Teams depending on it for production pipelines inherit maintenance responsibility.

Component	Detail
ExactSubstr	Rust implementation; removes exact or near-exact substring matches at corpus scale
NearDup	Produces document clusters; pre-computed outputs released for C4, RealNews, LM1B, Wiki-4B-en
Inspection scripts	Python scripts for inspecting and analyzing dedup results
Build requirements	Rust toolchain + C compiler; Python deps for demos and dataset generation (numpy, scipy, sentencepiece, requirements-tf.txt)
Repo status	Archived; not officially supported by Google
Paper	"Deduplicating Training Data Makes Language Models Better", arXiv:2107.06499

ExactSubstr and near-duplicate clustering in the repo

ExactSubstr locates all substrings above a configurable length threshold that appear in more than one document, then removes or flags them. The implementation uses a suffix-array construction over the concatenated corpus, which scales to hundreds of GB at the cost of proportionally large RAM requirements. NearDup operates at the document level, grouping documents by approximate similarity and releasing those cluster assignments for downstream use.

The distinction matters operationally: ExactSubstr is a filter (it removes content); NearDup is a classifier (it groups documents and leaves the merge decision to the user). Most research pipelines use both — ExactSubstr to remove verbatim boilerplate and NearDup to collapse paraphrased or reformatted duplicates that inflate dataset size without diversity.

Pro Tip: Exact substring removal alone is insufficient for scraped web corpora. A paragraph scraped from 500 mirror sites will have minor HTML-introduced variation — different whitespace, truncated sentences — that ExactSubstr will not catch as duplicates. NearDup clustering handles this class of near-identical content. On C4, the paper reports that near-duplicate removal substantially reduced train-test overlap; exact dedup alone would have missed a meaningful fraction of those cases.

What the README implies about scale and hardware

The README's hardware guidance is the clearest signal of the tool's target operating regime. Small datasets — on the order of 10 GB — run on a modern machine with roughly 16 GB RAM and a few CPU cores. Once the corpus approaches C4 scale, the requirements escalate sharply.

As the README states: "If you want to deduplicate something the size of C4 (~300GB) you will want a machine with as many cores as you can get (we used 96 cores) and >600GB of RAM. You will also need >1TB hard drive space."

Corpus size	CPU cores	RAM	Disk
~10 GB (small)	A few cores	~16 GB	Proportional to corpus
~300 GB (C4 scale)	96 cores (recommended)	>600 GB	>1 TB

These numbers are not conservative estimates padded for safety — they reflect the actual hardware the Google Research team used. The suffix-array construction for ExactSubstr loads the full corpus into memory in sorted form; at 300 GB of raw text, the in-memory data structures exceed what a standard cloud instance provides. Teams planning C4-scale jobs should budget accordingly, either for a high-memory bare-metal instance or a managed distributed processing service.

Benchmarks and operational trade-offs that matter in practice

The core competitive gap between these tools is not throughput — it is algorithmic completeness and the downstream impact on model quality. The paper backing Google's repo provides the most direct evidence: "Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy." In the paper's reported results, deduplication also reduces perplexity by up to 10% in some cases, and train-test overlap decreases by more than 4% of the validation set on standard datasets.

These numbers are not achievable with a dedup pass that misses near-duplicate content. A tool whose algorithm is undocumented cannot make reproducible guarantees about which of those gains it captures. Ambrosia's lighter packaging is a genuine advantage for fast iteration; it is not a substitute for ExactSubstr + NearDup when dataset quality is the primary variable being controlled.

Metric	Google repo (paper-reported)	Ambrosia
Memorization reduction	10× fewer memorized emissions	Not benchmarked
Perplexity improvement	Up to 10% in some cases	Not benchmarked
Train-test overlap reduction	>4% of validation set	Not benchmarked
C4-scale hardware (verified)	96 cores, >600 GB RAM, >1 TB disk	Not documented
Small-corpus hardware	~16 GB RAM, few cores	Not documented
Setup time	High (Rust + C + Python deps)	Low (Go module, single binary)

No published head-to-head benchmark compares Ambrosia and the Google repo on the same corpus. The operational comparison is therefore anchored to verified constraints from each tool's public documentation rather than a controlled experiment.

Small-corpus workflow versus research-corpus workflow

The workflow divergence maps cleanly to corpus size.

Decision dimension	Small corpus (≤~10 GB)	Research corpus (~300 GB, C4 scale)
Dataset size	Internal, self-assembled, prototype-scale	Public benchmark, release-scale, or model-training artifact
Reproducibility need	Team-internal only	External or publication-grade
Recommended tool	Ambrosia (or Google repo)	Google repo only
Setup effort	Minutes (Go install) vs. hours (Rust build + Python env)	Google repo; setup cost is fixed, not per-run
RAM requirement	~16 GB sufficient for Google repo	>600 GB for Google repo
Disk requirement	Proportional to corpus	>1 TB
CPU requirement	A few cores	96 cores recommended
Near-dedup support	Ambrosia: unknown; Google repo: yes	Google repo: yes
Reproducibility	Ambrosia: not documented; Google repo: paper-anchored	Google repo: paper-anchored
Publication-grade use	Usually unnecessary	Required

For small corpora where speed of setup is the binding constraint and the corpus is self-assembled (not a public benchmark dataset), Ambrosia's Go packaging means a lower total setup cost. For anything requiring auditability, large scale, or both, the Google repo is the only verified option of the two.

Where exactness beats convenience

Research-grade dedup becomes mandatory — not optional — when the dataset is destined for a published model, a public benchmark release, or any training run where memorization and train-test contamination are evaluation criteria.

Pro Tip: If your model will be evaluated on benchmarks that share source material with your training corpus (a near-universal condition for web-scraped data), running only a lightweight dedup pass creates a contamination risk you cannot quantify post-hoc. The Google paper quantifies the impact directly: models trained on deduplicated data emit memorized content ten times less often and show measurably lower train-test overlap. That gap exists regardless of which convenience tool you used upstream — the contamination is already in the weights.

The suffix-array approach in ExactSubstr finds duplicates that hash-based methods miss, particularly for short repeated substrings embedded in longer documents. When a dataset is the product — the artifact that determines the model's legal and empirical properties — algorithmic completeness outweighs setup friction by a wide margin.

Decision matrix: which pipeline should you choose?

Scenario	Dataset size	Reproducibility	Operational burden	Choice
Internal prototype corpus	≤ ~10 GB	Team-internal only	Low, Go-first integration	✅ Ambrosia is sufficient
Domain crawl or private fine-tune set	~10–100 GB	Helpful but not externally audited	Moderate; hardware requirements are unknown for Ambrosia	⚠️ Google repo if auditability matters
Publication-grade training data	~300 GB (C4 scale)	Required	High; 96 cores, >600 GB RAM, >1 TB disk	✅ deduplicate-text-datasets
Public benchmark or dataset release	Any size with contamination risk	Required	High; released cluster outputs and paper trail matter	✅ deduplicate-text-datasets
Rapid experimentation	Small to medium	Team-internal only	Low	✅ Ambrosia
Model evaluation tied to memorization and overlap	Any corpus with shared source material	Required	High	✅ deduplicate-text-datasets

Scenario	Ambrosia	deduplicate-text-datasets
Corpus size ≤ ~10 GB, internal use	✅ Sufficient	✅ Also viable; higher setup cost
Corpus size ~10–100 GB	⚠️ Scale limits unknown	✅ Viable on high-memory instance
Corpus size ~300 GB (C4 scale)	❌ Not documented for this scale	✅ Designed for this; 96 cores, >600 GB RAM, >1 TB disk
Reproducibility required	❌ Not benchmarked publicly	✅ Tied to published paper
Near-duplicate clustering needed	❌ Not documented	✅ NearDup with released cluster outputs
Publication or dataset release	❌ Insufficient auditability	✅ Mandatory
Prototype / rapid iteration	✅ Fast adoption, low friction	⚠️ High setup cost for prototyping
Go-based pipeline integration	✅ Native	❌ Requires Rust + C + Python
Team has Rust build capacity	Irrelevant	✅ Prerequisite met

Choose Ambrosia when speed of adoption matters more than maximal rigor

Ambrosia is the right choice when the dedup step is a pre-filter on a small, self-assembled corpus where no published baseline or external evaluation will scrutinize duplicate rates. Its Go packaging means a data engineer can wire it into an existing pipeline without acquiring Rust toolchain expertise or managing a Python environment with TensorFlow dependencies. For internal fine-tune experiments, domain adaptation on proprietary corpora under ~10 GB, or rapid prototyping where the dataset composition will change before training begins, Ambrosia's ergonomics deliver real time savings.

The key condition: your evaluation criteria do not depend on provably complete near-duplicate removal, and reproducibility is a team-internal concern rather than an external standard.

Choose google-research/deduplicate-text-datasets when the dataset is the product

When the deduplicated corpus is itself the artifact — a public release, a training set for a published model, or a benchmark dataset — the Google repo is the only verified option of the two. Its ExactSubstr implementation removes verbatim and near-verbatim substring repetition with a documented algorithm; its NearDup component produces auditable cluster outputs that teams can inspect, reproduce, and cite. The paper "Deduplicating Training Data Makes Language Models Better" provides the empirical grounding: 10× reduction in memorized emissions, up to 10% perplexity improvement, and >4% reduction in train-test overlap on standard datasets.

The operational cost is real: Rust toolchain, C compiler, Python dependencies, and for C4-scale work, a 96-core machine with more than 600 GB of RAM and a terabyte of disk. Teams without that infrastructure need to provision it. The repo is also archived and carries no official Google support commitment — factor in that the maintenance burden falls on the consuming team.

FAQ

What is text deduplication in NLP?

Text deduplication identifies and removes repeated content from a corpus before it is used for model training. Repetitions range from exact byte-level matches (the same paragraph scraped from 200 mirror sites) to near-duplicate documents that differ only in whitespace, punctuation, or minor paraphrasing. Removing them prevents models from memorizing repeated passages, reduces training steps wasted on redundant data, and improves the validity of held-out evaluation sets.

Why is deduplication important for LLM training data?

Without deduplication, a language model disproportionately memorizes the most frequent content in its training set — boilerplate legal text, SEO-farmed articles, templated content. The paper backing Google's repo quantifies this: models trained on deduplicated data emit memorized text ten times less frequently and can achieve the same or better perplexity in fewer training steps. Train-test overlap — where benchmark examples appear verbatim in training data — also decreases by more than 4% of the validation set, directly improving evaluation integrity.

What is the difference between exact and near-duplicate text deduplication?

Exact deduplication (ExactSubstr) identifies substrings or documents that are byte-for-byte identical above a configurable length threshold. Near-duplicate deduplication (NearDup) clusters documents that are similar but not identical, catching paraphrased or lightly reformatted copies. Most production pipelines apply both: exact dedup is faster and fully deterministic; near-dedup requires approximate similarity computation (via MinHashLSH, SimHash, or suffix-array methods) and involves a similarity threshold decision that affects recall.

How much RAM do I need for large-scale text deduplication?

For small corpora (~10 GB), roughly 16 GB of RAM and a few CPU cores suffice when using the Google repo's ExactSubstr. At C4 scale (~300 GB), the repo's README documents 96 cores, more than 600 GB of RAM, and more than 1 TB of disk. The suffix-array construction loads the concatenated corpus into memory in sorted form, which drives the RAM requirement up proportionally with corpus size. Ambrosia does not publish memory requirements for any corpus size.

Is MinHash or SimHash better for text deduplication?

Neither dominates across all regimes. MinHashLSH provides probabilistically bounded false-negative rates and is well-suited to document-level near-duplicate detection at scale. SimHash is faster to compute and works well for shorter documents and deduplication at the paragraph or sentence level. The choice depends on document length distribution, acceptable false-negative rate, and compute budget. No verified head-to-head benchmark between the two was retrieved for this article; consult the deduplication literature directly for corpus-specific guidance.

Pro Tip: Unsure which tool to pick? Return to the decision matrix above and answer two questions first: (1) Is this corpus destined for a published model or external evaluation? (2) Is the corpus larger than ~10 GB? If either answer is yes, the Google repo is the safer choice regardless of setup friction.

Sources & References

Source	Use in article
google-research/deduplicate-text-datasets	Primary repository source for ExactSubstr, NearDup, released cluster outputs, and hardware guidance
"Deduplicating Training Data Makes Language Models Better" (arXiv:2107.06499)	Paper backing the memorization, overlap, and training-step claims
arXiv:2107.06499 PDF	Source for the perplexity result cited in the benchmarks section
Ambrosia Go package — pkg.go.dev	Official package documentation for Ambrosia and its CLI/package surface

google-research/deduplicate-text-datasets — Primary source: GitHub repository containing ExactSubstr and NearDup deduplication code, scripts, and released cluster outputs for C4, RealNews, LM1B, and Wiki-4B-en. Archived; not officially supported by Google.
"Deduplicating Training Data Makes Language Models Better" (arXiv:2107.06499) — Peer-reviewed paper providing empirical grounding for the Google repo's methodology, including memorization reduction and perplexity results.
arXiv:2107.06499 PDF — Source for the "up to 10% perplexity reduction" figure cited in the benchmarks section.
Ambrosia Go package — pkg.go.dev — Official package documentation for Ambrosia (github.com/reactorsh/ambrosia), the source for CLI flag details and packaging model.

Was this guide helpful?

Share: X · LinkedIn · Reddit