At a glance: what you need before running Setu
At a Glance: Time: 2–4 hours to first filtered corpus · Prereqs: Linux (native or WSL), Miniconda, OpenJDK 11, Apache Spark 3.5.1 for Hadoop 3.3, Python 3.10.x+ · Hardware: Any multi-core machine with ≥16 GB RAM for local testing; distributed Spark cluster for TB-scale corpora · Cost: Open-source stack; Google Cloud Vision API billed per page for PDF OCR
Setu is a four-stage Spark-native pipeline from AI4Bharat that turns raw web crawls, PDFs, and speech transcripts into SFT-ready corpora. As its documentation states: "This documentation provides an overview of Setu and its workflow, enabling users to efficiently manage and process Web, PDF, and Speech data with Apache Spark." The four stages run in strict order — document preparation → cleaning and analysis → flagging/filtering → deduplication — and skipping or reordering them degrades output quality in measurable ways.
Hugging Face Datasets, backed by Apache Arrow, handles the final hand-off: once Spark writes out the filtered records, Datasets loads them with zero-copy reads and exposes the split/export API that fine-tuning frameworks expect. The two tools are complementary: Spark owns the heavy distributed cleaning work; Datasets owns the trainer-facing interface.
Apache Spark 3.5.1 is the runtime that makes Setu's distributed cleaning stages economical at scale. Local mode works for corpora under ~50 GB; larger jobs need a proper cluster or cloud executor.
Windows users must use WSL. Setu's documentation is direct: "Note that users who want to run the pipeline on Windows systems are advised to use WSL (Windows Subsystem for Linux) for easier usage." Several shell scripts and native dependencies are Linux-only and will not run under PowerShell or CMD.
Prerequisites and environment setup
Setu's install surface has four hard dependencies: a Linux shell (native or WSL), Miniconda for environment isolation, OpenJDK 11 for the JVM, and Apache Spark 3.5.1 built against Hadoop 3.3. Every version pin matters — Spark 3.x has JVM compatibility constraints, and Setu's PySpark calls expect the Spark 3.5 API surface.
After a working WSL/Linux shell is confirmed, set spark.driver.host to localhost in spark-defaults.conf. Without this binding, a local Spark driver picks up the WSL virtual network interface instead of loopback, which causes confusing connection-refused errors before the first job even launches.
# $SPARK_HOME/conf/spark-defaults.conf
spark.driver.host localhost
spark.executor.memory 4g
spark.driver.memory 2g
Install Python, conda, Java, and WSL-friendly dependencies
Clone the repository, then build the environment directly from the committed environment.yml. This avoids version drift between Setu's tested dependency set and whatever pip install would resolve on a given date.
# Confirm WSL or Linux shell before proceeding
$ uname -a
# Install OpenJDK 11 (required by Spark 3.5.1)
$ sudo apt-get update && sudo apt-get install -y openjdk-11-jdk
# Confirm Java version — must show openjdk 11
$ java -version
# Clone Setu and create the conda environment
$ git clone https://github.com/AI4Bharat/setu.git
$ cd setu
$ conda env create -f environment.yml
$ conda activate setu
# Download Spark 3.5.1 for Hadoop 3.3
$ wget https://downloads.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
$ tar -xzf spark-3.5.1-bin-hadoop3.tgz
$ export SPARK_HOME=$PWD/spark-3.5.1-bin-hadoop3
$ export PATH=$SPARK_HOME/bin:$PATH
$ export PYSPARK_PYTHON=$(which python)
Python 3.10.x or above is required. The environment.yml pins this, but verify with python --version after activation — conda sometimes resolves to 3.9 on older base installs if the channel cache is stale.
Verify Spark and PySpark before touching the corpus
Confirm the Spark runtime before processing any data. A failed verification here costs minutes; a silent misconfiguration discovered mid-job costs hours.
# Verify Spark 3.5.1 is on PATH
$ spark-shell --version
# Expected: version 3.5.1, Using Scala version 2.12.x, Java 11
# Verify PySpark picks up the same Spark installation
$ python - <<'EOF'
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[*]") \
.appName("setu-verify") \
.getOrCreate()
print(spark.version) # must print 3.5.1
print(spark.sparkContext.getConf().get("spark.driver.host")) # must print localhost
spark.stop()
EOF
Expected output: 3.5.1 on the first line, localhost on the second. Any other value indicates a SPARK_HOME mismatch or a stale spark-defaults.conf.
Step 1: Prepare raw web, PDF, and speech inputs
Document preparation is Setu's first stage and the one most practitioners under-invest in. Upstream extraction quality places a hard ceiling on downstream filtering precision — garbage in means your filtering budget gets spent on noise that should never have entered the pipeline. Setu's quickstart references trafilatura for HTML and the Google Cloud Vision SDK for PDF OCR, and the Setu documentation frames the workflow as a Spark-based path for web, PDF, and speech data.
Extract text from HTML with trafilatura
Trafilatura is a Python package and command-line tool that strips navigation bars, cookie banners, ads, and structural HTML from web pages, leaving the main content as plain text. Its core function API also exposes a deduplicate flag that removes duplicate segments before the document even enters Setu's pipeline — a cleaning step that generic Hugging Face Datasets creation workflows skip entirely, relying instead on the user to bring pre-cleaned text.
import trafilatura
from trafilatura.settings import use_config
# Configure trafilatura to strip boilerplate and deduplicate segments
config = use_config()
config.set("DEFAULT", "EXTRACTION_TIMEOUT", "30")
def extract_web_text(html: str, url: str) -> dict | None:
"""Return extracted text plus metadata, or None if extraction fails."""
result = trafilatura.extract(
html,
url=url,
include_comments=False,
include_tables=True,
deduplicate=True, # removes duplicate paragraphs within the page
output_format="json", # preserves title, date, language fields
config=config,
)
if result is None:
return None
import json
return json.loads(result) # keys: text, title, date, language, url
# Spark UDF for distributed extraction across a web crawl partition
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
extract_udf = udf(lambda html, url: extract_web_text(html, url), StringType())
The output_format="json" flag is important: it preserves the language, date, and url fields that Setu's quality analysis stage uses as filtering signals. Discarding them here forces you to re-derive them later at higher cost.
Turn PDFs into OCR JSON and page-level text
PDF text extraction through Setu routes through the Google Cloud Vision SDK, which outputs structured JSON with per-page bounding boxes and confidence scores. Storing the full OCR JSON — not just the extracted string — preserves layout metadata that can distinguish a table cell from a paragraph during the cleaning stage.
from google.cloud import vision
import json
from pathlib import Path
def ocr_pdf_to_json(pdf_path: str, output_dir: str) -> str:
"""
Submit a PDF to Cloud Vision OCR and write per-page JSON to output_dir.
Returns the path to the written JSON file.
"""
client = vision.ImageAnnotatorClient()
gcs_source_uri = f"gs://your-bucket/{Path(pdf_path).name}"
gcs_dest_uri = f"gs://your-bucket/ocr-output/{Path(pdf_path).stem}/"
feature = vision.Feature(type_=vision.Feature.Type.DOCUMENT_TEXT_DETECTION)
gcs_source = vision.GcsSource(uri=gcs_source_uri)
input_config = vision.InputConfig(gcs_source=gcs_source, mime_type="application/pdf")
gcs_dest = vision.GcsDestination(uri=gcs_dest_uri)
output_config = vision.OutputConfig(gcs_destination=gcs_dest, batch_size=10)
async_request = vision.AsyncAnnotateFileRequest(
features=[feature],
input_config=input_config,
output_config=output_config,
)
operation = client.async_batch_annotate_files(requests=[async_request])
operation.result(timeout=300) # blocks until OCR job completes
return gcs_dest_uri # downstream stages read JSON directly from GCS
Store OCR output as one JSON file per document, not one per page. Page-level text can be reconstructed from the JSON, but you cannot reconstruct the document-level structure if you sharded by page first.
Step 2: Clean and analyze the documents
Setu's cleaning and analysis stage runs after document preparation and before any filtering threshold is applied, as described in the Setu documentation. This ordering matters: cleaning normalizes the text so quality signals are computed on comparable representations, not raw extraction artifacts. The competitive gap with generic Hugging Face Datasets dataset creation is sharp here — datasets.load_dataset() loads whatever text you give it without inspecting its internal quality. Setu's cleaning stage catches artifacts that would otherwise pass into training: encoding mojibake, HTML entity residuals, truncated OCR lines, and script-mixed sentences.
Remove boilerplate, malformed text, and noisy segments
The cleaning pass targets four categories of noise: structural boilerplate (repeated headers/footers that trafilatura missed), malformed UTF-8 or encoding artifacts, OCR noise in PDF-derived text, and excessively short or empty segments.
import unicodedata
import re
from datasets import Dataset
# Patterns that indicate malformed or boilerplate content
BOILERPLATE_PATTERNS = [
re.compile(r"(cookie policy|privacy policy|all rights reserved)", re.IGNORECASE),
re.compile(r"[\x00-\x08\x0b\x0c\x0e-\x1f]"), # non-printable control chars
re.compile(r"(.)\1{6,}"), # 6+ repeated characters (OCR artifacts)
]
def clean_document(record: dict) -> dict:
"""
Normalize and strip artifacts. Preserves metadata fields for audit trail.
Returns record with 'text_clean' field added and 'clean_pass' flag set.
"""
text = record.get("text", "")
# Normalize unicode to NFC — critical for Indic scripts and mixed corpora
text = unicodedata.normalize("NFC", text)
# Strip boilerplate patterns
for pattern in BOILERPLATE_PATTERNS:
text = pattern.sub(" ", text)
# Collapse excessive whitespace without destroying paragraph breaks
text = re.sub(r" {2,}", " ", text)
text = re.sub(r"\n{3,}", "\n\n", text)
record["text_clean"] = text.strip()
record["clean_pass"] = len(text.strip()) > 50 # flag documents that survive cleaning
return record
Retain the original text field alongside text_clean during development. If a cleaning regex is too aggressive, you need the original to diagnose what was lost.
Score document quality before any filtering threshold is applied
Quality scoring precedes threshold application so you can inspect the score distribution before committing to cutoffs. Running Hugging Face Datasets' batched map API in batched mode makes this efficient even over large corpora, while the Setu documentation makes clear that this stage sits after cleaning and before filtering.
import langdetect
from datasets import Dataset
def score_document(batch: dict) -> dict:
"""
Compute heuristic quality signals over a batch of cleaned documents.
Hugging Face Datasets batched map passes lists, not individual records.
"""
scores = []
languages = []
for text in batch["text_clean"]:
word_count = len(text.split())
unique_ratio = len(set(text.split())) / max(word_count, 1)
alpha_ratio = sum(c.isalpha() for c in text) / max(len(text), 1)
# Composite quality score: penalize very short, repetitive, or non-alphabetic text
quality = (
min(word_count / 200, 1.0) * 0.4 # length signal, saturates at 200 words
+ unique_ratio * 0.3 # lexical diversity
+ alpha_ratio * 0.3 # alphabetic content ratio
)
scores.append(round(quality, 4))
try:
lang = langdetect.detect(text)
except Exception:
lang = "unknown"
languages.append(lang)
batch["quality_score"] = scores
batch["detected_lang"] = languages
return batch
# Apply over a Hugging Face Dataset loaded from Setu's extraction outputs
ds = Dataset.from_json("setu_extraction_output.jsonl")
ds = ds.map(score_document, batched=True, batch_size=256)
Plot the quality_score histogram before setting any filter threshold. A bimodal distribution signals two distinct content types (e.g., full articles vs. stub pages); a single cut-off will either under-filter noise or over-prune valid short-form content.
Step 3: Flag and filter low-quality samples
Setu's flagging and filtering stage applies hard thresholds to the quality signals computed in stage two. The Setu GitHub repository documents the workflow order as document preparation → cleaning and analysis → flagging/filtering → deduplication, so the distinction between flagging and filtering is operational: flagging marks records with rejection reasons (preserving them for audit), filtering drops them. Running both in the same pass means you can report per-flag rejection rates before committing the filtered corpus to disk.
Apply rule-based flags for length, language, and structural noise
TARGET_LANG = "hi" # target language for the fine-tuning corpus; change as needed
MIN_WORDS = 30
MAX_WORDS = 10_000
MIN_QUALITY_SCORE = 0.35
def flag_document(record: dict) -> dict:
"""
Apply rule-based flags. Sets 'flags' (list of rejection reasons) and
'keep' (bool). Records with any flag are excluded from the filtered corpus.
"""
flags = []
word_count = len(record["text_clean"].split())
if word_count < MIN_WORDS:
flags.append("too_short")
if word_count > MAX_WORDS:
flags.append("too_long")
if record.get("detected_lang") != TARGET_LANG:
flags.append("wrong_language")
if record.get("quality_score", 0) < MIN_QUALITY_SCORE:
flags.append("low_quality_score")
if not record.get("clean_pass", False):
flags.append("failed_clean")
record["flags"] = flags
record["keep"] = len(flags) == 0
return record
ds = ds.map(flag_document)
# Report rejection rate per flag before filtering
from collections import Counter
all_flags = [f for record in ds for f in record["flags"]]
print(Counter(all_flags)) # inspect before discarding flagged records
ds_filtered = ds.filter(lambda r: r["keep"])
Watch Out: Over-filtering is as damaging as under-filtering for minority-language or domain-specific corpora. A
MIN_WORDSthreshold calibrated on English news text will aggressively reject valid short-form content in agglutinative languages where information density per word is higher. Inspect yourtoo_shortrejection count by language before finalizing thresholds.
Calibrate thresholds so you do not over-prune useful data
Set thresholds on a stratified sample — not the full corpus — and measure rejection rate per source before locking them in. A threshold that drops 5% of English web text may drop 40% of speech transcript segments, which have structurally shorter sentences.
# Sample 1000 records per source type for threshold inspection
sample = ds.shuffle(seed=42).select(range(min(3000, len(ds))))
by_source = sample.to_pandas().groupby("source")["keep"].mean()
print(by_source) # retention rate per source — flag any source below 0.5
Pro Tip: When your corpus includes synthetic or teacher-generated examples (e.g., instruction pairs from a distillation pipeline), filter thresholds calibrated on natural text will systematically over-retain stylistically uniform synthetic samples while discarding rare natural examples. Apply a separate threshold configuration per source type and track retention rates independently. A high retention rate on synthetic data combined with a low rate on organic data is a signal of bias amplification, not quality improvement.
Step 4: Deduplicate with MinHashLSH and text-dedup
Deduplication runs after filtering, not before — explained in detail in the subsection below. text-dedup provides the deduplication backend Setu references, with both exact-hash and MinHash/MinHashLSH modes. Its Spark implementation is designed for TB-scale text deduplication, making it the natural companion to Setu's Spark-native cleaning stages.
MinHashLSH answers a specific question: are two documents semantically near-duplicate even if their byte sequences differ? It computes Jaccard similarity estimates over shingled token sets using locality-sensitive hashing, which scales to billions of document pairs without all-pairs comparison. As text-dedup's documentation states: "MinHash + MinHashLSH for near-duplicate detection."
Choose exact-match versus fuzzy deduplication
Use exact-hash dedup when your corpus contains literal copy-paste duplicates — identical records from multiple crawl passes or mirrored sites. Use MinHashLSH when paraphrases, lightly edited duplicates, or boilerplate-padded variants need to be collapsed.
# Exact-hash deduplication (fast, zero false positives)
# text-dedup CLI: python -m text_dedup.exact_hash --path ./filtered --output ./dedup_exact
# MinHashLSH fuzzy deduplication via text-dedup's Spark implementation
# Launched as a Spark job to scale across large corpora
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[*]") \
.appName("setu-minhash-dedup") \
.config("spark.driver.host", "localhost") \
.getOrCreate()
# text-dedup's Spark MinHash expects a DataFrame with a 'text' column
df = spark.read.parquet("./filtered_corpus/*.parquet")
# Run via CLI for production; here we show the equivalent programmatic config
# python -m text_dedup.minhash \
# --path ./filtered_corpus \
# --output ./deduped_corpus \
# --column text_clean \
# --threshold 0.85 \ # Jaccard similarity threshold; 0.85 catches near-duplicates
# --num_perm 128 # number of hash permutations; higher = better precision
print(f"Records before dedup: {df.count()}")
A Jaccard threshold of 0.85 is a reasonable starting point for natural-language corpora. Drop it to 0.7 only if your corpus contains paraphrase-heavy domains (e.g., news wire with multiple rewrites of the same story). Raise it to 0.95 for corpora where high similarity reflects legitimate repetition (legal clauses, technical specifications).
Run deduplication after quality gates, not before
Setu's stage ordering — prepare → clean → filter → deduplicate — is deliberate. Running MinHashLSH before cleaning and filtering means the LSH index is built over noisy documents, and near-duplicates are resolved by randomly retaining one member of each cluster regardless of quality. You may keep the noisier version.
Watch Out: Early deduplication locks in noise and structural biases. If two near-duplicate documents exist — one a clean, full article and one a partially OCR-corrupted version — LSH dedup run before cleaning cannot distinguish them. The retained document is chosen by cluster membership order, not quality. Always run quality gates first so dedup selects the better copy.
After deduplication completes, write the output as Parquet to preserve column types and compression ratio before the Hugging Face Datasets conversion step.
Step 5: Convert the cleaned corpus into Hugging Face Datasets
Spark's output partitions are Parquet files. Hugging Face Datasets reads Parquet natively, so the hand-off is a single load_dataset call. The Arrow-backed in-memory format then makes all subsequent split, map, and export operations operate at memory speed without re-parsing the source files, as documented in Hugging Face Datasets.
Load filtered records and preserve metadata
from datasets import load_dataset
# Load all Parquet partitions from the deduped Spark output directory
ds_clean = load_dataset(
"parquet",
data_files={"train": "deduped_corpus/*.parquet"},
split="train",
)
# Confirm required columns survived the pipeline
required_cols = ["text_clean", "source", "doc_id", "quality_score", "detected_lang", "flags"]
missing = [c for c in required_cols if c not in ds_clean.column_names]
assert not missing, f"Missing audit columns: {missing}"
print(ds_clean)
# Dataset({features: ['text_clean', 'source', 'doc_id', ...], num_rows: ...})
Keep doc_id, source, quality_score, and flags columns in the exported dataset. These fields let you trace a problematic training example back to its origin document and the filter decision that retained it — essential for post-training debugging.
Create train, validation, and test splits for SFT
Split after all filtering and deduplication. Splitting beforehand risks contaminating validation and test sets with near-duplicates of training examples that deduplication would have removed.
from datasets import DatasetDict
# Stratify by source to maintain proportional representation in each split
ds_split = ds_clean.train_test_split(test_size=0.1, seed=42)
ds_val_test = ds_split["test"].train_test_split(test_size=0.5, seed=42)
final_ds = DatasetDict({
"train": ds_split["train"],
"validation": ds_val_test["train"],
"test": ds_val_test["test"],
})
# Export in a format compatible with SFT frameworks (e.g., TRL, LLaMA-Factory)
final_ds.save_to_disk("sft_corpus_v1")
# Also export JSONL for frameworks that do not read Datasets natively
final_ds["train"].to_json("sft_corpus_v1_train.jsonl", orient="records", lines=True)
Verify the cleaned corpus before fine-tuning
Verification is where most pipeline implementations stop short — they filter, then immediately fine-tune. Checking corpus metrics before training catches threshold miscalibrations that are cheap to fix now and expensive to discover after a training run.
Check whether filtering improved sample quality
The table below provides a factual verification aid for pipeline review. Use it as a pass/fail checkpoint rather than a benchmark claim.
| Stage | Metric | Observation | Action |
|---|---|---|---|
| Raw extraction | Record count | Log your corpus baseline before cleaning | Establish the denominator for every downstream ratio |
| After cleaning | Quality score distribution | Plot the histogram and inspect for a bimodal split | Separate stub pages from long-form documents before choosing thresholds |
| After filtering | Rejection counts by flag | Review too_short, wrong_language, and low_quality_score counts |
Tune thresholds per source if one class dominates |
| After MinHashLSH dedup | Duplicate clusters removed | Count cluster members retained after the LSH pass | Keep the higher-quality copy in each near-duplicate set |
Track lift from each stage independently. If cleaning removes fewer than 5% of records, your extraction is already clean and you may be over-investing in that stage. If filtering removes more than 50%, your thresholds are too aggressive — revisit the MIN_QUALITY_SCORE and language filter settings.
Inspect retained examples by source and failure mode
import pandas as pd
df = ds_clean.to_pandas()
# Retention rate and median quality by source
summary = df.groupby("source").agg(
count=("doc_id", "count"),
median_quality=("quality_score", "median"),
median_length=("text_clean", lambda x: x.str.split().str.len().median()),
).reset_index()
print(summary)
# Sample 5 retained records per source for manual review
for source, group in df.groupby("source"):
print(f"\n=== {source} ===")
print(group["text_clean"].sample(min(5, len(group)), random_state=0).to_string())
Manual inspection of 20–30 retained records per source class (web, PDF, speech) consistently surfaces systematic issues that metrics miss: OCR tables that survived as garbled token sequences, navigation text that trafilatura partially failed to strip, or speech transcripts with speaker labels embedded in the main text.
Common pitfalls when building the pipeline
Most Setu pipeline failures fall into three categories: environment setup errors that prevent startup, extraction mismatches that silently drop or corrupt documents, and filtering miscalibration that collapses the corpus before deduplication.
Watch Out: Windows-native execution fails silently in multiple places — shell scripts reference
/bin/bashpaths, some conda packages install Linux-native binaries, and Spark's process management assumes POSIX process groups. Use WSL2 rather than WSL1; WSL1's filesystem translation layer introduces enough overhead to make Spark local mode unreliable on large partitions. Thespark.driver.host localhostsetting is also WSL-specific: without it, the Spark driver binds to the WSL virtual NIC IP, which changes across reboots.
Fix WSL, Java, and Spark startup issues
# Check WSL version — should be 2
$ wsl --list --verbose
# Confirm JAVA_HOME points to OpenJDK 11, not a conflicting JDK
$ echo $JAVA_HOME
$ java -version # must show openjdk 11
# Confirm SPARK_HOME is set and on PATH
$ echo $SPARK_HOME
$ spark-submit --version # must show 3.5.1
# Run spark-shell as the quickstart recommends to confirm the runtime
$ spark-shell --master local[2]
# At the scala> prompt: spark.version // should return "3.5.1"
# :quit to exit
# If the driver fails to bind, check the WSL IP and compare to spark-defaults.conf
$ hostname -I
$ cat $SPARK_HOME/conf/spark-defaults.conf | grep driver.host
If hostname -I returns a non-loopback address and spark-defaults.conf has spark.driver.host localhost, PySpark will use loopback correctly. If the file is missing, copy the template: cp $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf and add the driver host line manually.
Avoid silent data loss during extraction or filtering
Silent data loss is the hardest pipeline failure to detect. A regex with a character class bug can match 30% of documents; a missing Cloud Vision API credential silently returns empty OCR JSON for every PDF batch.
Pro Tip: Log rejection counts at every stage boundary — raw input count, post-extraction count, post-cleaning count, post-filter count, post-dedup count — and write these counts to a metrics file alongside the corpus version. A single-stage drop exceeding 20% beyond the expected range is a signal, not a feature. With Hugging Face Datasets,
len(ds)before and after every.filter()or.map()call costs microseconds and should be standard practice throughout the pipeline.
# Wrap every filtering step with count logging
def filtered_with_audit(ds, filter_fn, stage_name: str):
before = len(ds)
ds_out = ds.filter(filter_fn)
after = len(ds_out)
print(f"[{stage_name}] {before:,} → {after:,} records ({(before-after)/before*100:.1f}% rejected)")
return ds_out
ds_filtered = filtered_with_audit(ds, lambda r: r["keep"], "rule_based_filter")
FAQ
Do I need Linux or WSL to run Setu?
Yes. Setu's documentation states explicitly: "users who want to run the pipeline on Windows systems are advised to use WSL (Windows Subsystem for Linux) for easier usage." Multiple pipeline scripts use Linux-only paths and utilities. Native Windows execution is unsupported; WSL2 is the recommended workaround.
What is MinHashLSH used for in deduplication?
MinHashLSH identifies near-duplicate documents — text pairs that are paraphrases or lightly edited copies — without requiring all-pairs comparison. It estimates Jaccard similarity over token shingles using locality-sensitive hashing, making it practical for corpora with millions to billions of documents. text-dedup's implementation, used in Setu's deduplication stage, includes a Spark backend suitable for TB-scale datasets.
What is dataset filtering in machine learning?
Dataset filtering removes examples from a training corpus that are too short, off-domain, duplicated, or otherwise low-quality before fine-tuning. In Setu's pipeline, this is a staged process: cleaning normalizes text, quality scoring computes signals, rule-based flagging applies thresholds, and MinHashLSH deduplication removes near-duplicates.
How do you clean a dataset before fine-tuning?
Start with source-specific extraction (trafilatura for HTML, OCR for PDFs), normalize unicode, strip boilerplate and encoding artifacts, compute quality signals (word count, unique token ratio, alphabetic ratio, language), apply rule-based flags with calibrated thresholds, then deduplicate. Do not skip straight to loading raw text into Hugging Face Datasets — it does not perform source-level cleaning.
Can Hugging Face Datasets handle large datasets?
Yes. Hugging Face Datasets is backed by Apache Arrow and supports streaming and zero-copy reads, which means memory usage scales with batch size rather than dataset size. For TB-scale corpora, use streaming=True in load_dataset() to process without materializing the full dataset in RAM.
Sources & References
- Setu GitHub Repository — Primary source; pipeline architecture, quickstart, and environment requirements
- AI4Bharat Setu Documentation — Official pipeline overview and stage descriptions
- Hugging Face Datasets Documentation — API reference for loading, mapping, splitting, and exporting datasets
- Hugging Face Datasets GitHub — Source and README with batched map examples
- Apache Spark 3.5.1 Release — Runtime used by Setu's distributed processing stages
- trafilatura Documentation — Core Functions — HTML extraction API including deduplicate parameter
- trafilatura GitHub — Package overview and feature list
- text-dedup GitHub — MinHashLSH and exact-hash deduplication implementations including Spark backend
- text-dedup — socket.dev package listing — Package feature summary including Spark scale reference
Keywords: Setu, Hugging Face Datasets, Apache Spark 3.5.1, OpenJDK 11, Python 3.10, WSL, trafilatura, Google Cloud Vision SDK, MinHashLSH, text-dedup, PySpark, Hadoop 3.3, conda environment.yml, Spark driver host localhost

