AI & ML

All about AI and Machine Learning, Latest articles, advances in domain.

AI & ML · Featured

FineWeb-Edu and the case for educational data in pre-training: what changed in MMLU and ARC

FineWeb-Edu is a 1.3T-token educational subset of FineWeb whose paper reports large gains on knowledge- and reasoning-heavy evaluations, including higher MMLU and ARC scores than the base FineWeb subset — but the lift comes from a carefully filtered educational slice, not from adding more generic web text.

Jul 11, 202620 min read

All articles

SortNewest Oldest Popular

AI & ML

Mamba-2 vs Transformers are SSMs: what Structured State Space Duality changes in practice

Structured State Space Duality shows Mamba-2 and masked attention are two contraction orders over the same semiseparable structure — yielding a core layer that is 2–8× faster than Mamba’s fused scan and competitive with Transformers, but the gains are most compelling for long sequences and the paper still shows better quality when a few attention layers are mixed in.

23 min read

AI & ML

How to use PyTorch Context Parallel for long-context transformer training

PyTorch Context Parallel shards long sequences across devices so each rank only holds a context slice for attention and KV handling — this makes 1M-token training feasible in the PyTorch/Torchtitan workflow — but it is still a distributed training feature that depends on correct process-group setup, NCCL communication, and long-context-aware model partitioning.

20 min read

AI & ML

What RULER and LongBench v2 reveal about long-context benchmark failures

RULER demonstrates that needle-in-a-haystack is a superficial long-context test because models can score near-perfectly there and still collapse on multi-hop tracing and aggregation as sequence length grows, while LongBench v2 shows that realistic long-context multitasks still defeat most models — the best direct-answer system only reaches 50.1% and even human experts sit at 53.7% under time pressure.

18 min read

AI & ML

Should you extend context or retrain for long-context workloads? Lessons from RULER and LongBench v2

RULER shows that many models look near-perfect on vanilla needle-in-a-haystack yet suffer large drops as context length and task complexity rise, while LongBench v2 shows the best direct-answer model still reaches only 50.1% accuracy and o1-preview reaches 57.7% — but that gap does not automatically justify retraining, because the right choice depends on whether your workload needs deeper reasoning, not just longer windows.

21 min read

AI & ML

Should you use long context or retrieval-augmented generation for 100K-token workloads?

For 100K-token workloads, long context can be the right tool for global document understanding or implicit queries, but production economics are often brutal: the cited 2026 decision framework says 1M-token requests can run 30–60x slower and roughly 1,250x more expensive per query than RAG — with the main caveat that long context still wins when the answer depends on relationships across the whole corpus.

17 min read

AI & ML

What RULER reveals about the real context size of long-context language models

RULER shows that near-perfect needle-in-a-haystack scores can mask steep degradation on harder long-context tasks — the paper evaluates 17 models across 13 tasks and finds that almost all drop sharply as context length increases, with only half maintaining satisfactory performance at 32K — but synthetic benchmark success still does not guarantee real-world long-context reliability.

17 min read

AI & ML

Ambrosia vs Google's deduplicate-text-datasets: choosing a text-dedup pipeline for LLM training data

Google’s deduplicate-text-datasets provides exact substring deduplication in Rust plus near-duplicate clustering for large corpora, while Ambrosia is a lightweight package aimed at ergonomics — but the deciding constraint is scale and rigor, because Google’s repo is built for research-grade dataset deduplication with very large-memory jobs, whereas simpler tools trade accuracy and reproducibility for convenience.

19 min read

AI & ML

How to fine-tune Mixtral models with Megatron-Core MoE settings in 2026

Megatron-Core’s MoE stack is production-ready for large-scale MoE training and exposes routing, expert-parallel, and capacity controls that matter when fine-tuning Mixtral — but the official docs emphasize that the exact behavior depends on parallelism layout, router configuration, and capacity settings rather than a one-size-fits-all recipe.

15 min read

AI & ML

DeepSeek-V3 and the case for auxiliary-loss-free MoE benchmarks

DeepSeek-V3 is benchmark-relevant not just because it is large, but because it combines auxiliary-loss-free load balancing, multi-token prediction, and FP8 training at scale — and MLCommons is now using it as a pretraining benchmark with a 671B/37B MoE reference setup, making it a meaningful test of modern sparse-training infrastructure rather than just another model card.

22 min read

AI & ML

LongRoPE internals: how non-uniform RoPE rescaling reaches 2M tokens without retraining from scratch

LongRoPE exploits two non-uniformities in RoPE interpolation — across RoPE dimensions and token positions — and uses an evolutionary search to find per-dimension, per-position rescaling factors, which enables an 8× non-finetuning extension and then a progressive 256k→2048k extension path — but it still needs short-context readjustment to recover original-window performance.

20 min read

AI & ML

DeepSeek-V3 auxiliary-loss-free load balancing: how the router changes MoE training

DeepSeek-V3 replaces the usual router auxiliary loss with a dynamically adjusted per-expert bias term for load balancing — preserving the load-balancing goal while avoiding the performance degradation the paper attributes to heavy auxiliary losses, but the benefit is tied to sequence-wise balance and node-limited routing rather than eliminating imbalance entirely.

22 min read

AI & ML

The weekly brief.