AI & ML

DeepSeek-V3 and the case for auxiliary-loss-free MoE benchmarks

DeepSeek-V3 is benchmark-relevant not just because it is large, but because it combines auxiliary-loss-free load balancing, multi-token prediction, and FP8 training at scale — and MLCommons is now using it as a pretraining benchmark with a 671B/37B MoE reference setup, making it a meaningful test of modern sparse-training infrastructure rather than just another model card.

By AxiomLogica Editorial

May 26, 202622 min read

Reviewed by Editorial

DeepSeek-V3 and the case for auxiliary-loss-free MoE benchmarks

What DeepSeek-V3 adds to the MoE benchmarking conversation

The addition of DeepSeek-V3 to MLCommons' MLPerf Training v6.0 benchmark suite shifts the conversation about MoE benchmarks from capability claims to infrastructure measurement. MLPerf Training has historically benchmarked dense models at scales where compute bottlenecks are well-understood; sparse MoE at frontier scale introduces qualitatively different challenges — expert parallelism, routing stability, load distribution — that those prior benchmarks never stress-tested. DeepSeek-V3 gives the field a concrete reference point with published methodology rather than another closed leaderboard entry.

Bottom Line: DeepSeek-V3 matters for MoE benchmarks not primarily because of its downstream scores, but because it is the first auxiliary-loss-free, large-scale sparse model — 671B total parameters, 37B active per token, trained on 14.8T tokens — to enter the MLCommons standardization process. MLPerf Training v6.0 uses it to measure how fast systems train a MoE model to a 3.6 log perplexity target, as defined on the MLCommons benchmark page. That framing makes the benchmark a test of sparse-training infrastructure, not a proof of general model superiority. Researchers reading the paper for design signals must separate the architectural contributions (auxiliary-loss-free routing, multi-token prediction, FP8 training) from the benchmark outcome, which measures convergence speed on one reference task.

As MLCommons states: "MLPerf™ Training v6.0 adds a large-scale pretraining benchmark built on DeepSeek-V3, a Mixture-of-Experts (MoE) architecture with 671B total parameters, of which 37B are activated per token." The MLPerf Training benchmark suite measures how fast systems can train models to a target quality metric — which means the DeepSeek-V3 benchmark is explicitly a training-efficiency test, not an evaluation of chat quality, safety, or downstream task accuracy.

Why auxiliary-loss-free load balancing is the methodological hinge

The single most consequential design decision in DeepSeek-V3 for the benchmarking community is its auxiliary-loss-free load balancing strategy. Classic MoE training adds an auxiliary loss term to the primary language modeling objective to prevent experts from becoming degenerate — either collapsed to a single overloaded expert or spread so thin that routing becomes random. DeepSeek-V3 discards that loss term entirely, replacing it with a bias-based correction mechanism applied during expert selection. The paper explicitly frames this as a pioneering approach: "DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing."

Why this matters for the benchmark: auxiliary loss terms create an optimization tension. The model is simultaneously minimizing next-token prediction loss and an auxiliary routing regularizer whose scale must be tuned carefully. Too large and the model sacrifices prediction quality for balance; too small and balance degrades. Any benchmark that aggregates multiple training runs or compares across hardware stacks must contend with this sensitivity. Removing the auxiliary loss eliminates a free hyperparameter from the benchmark surface, making comparisons cleaner.

Pro Tip: The key distinction is not just "no auxiliary loss" but how balance is maintained without it. Classic auxiliary-loss approaches impose a gradient signal that distorts the primary objective; the bias-term approach in DeepSeek-V3 applies corrections at the routing decision boundary without altering gradients through the loss. For benchmark readers, this means that training curves from auxiliary-loss-free runs are cleaner baselines — they don't carry the artifact of a competing objective — but they require validating that routing actually stayed balanced throughout, which is a separate empirical check the paper provides.

What load balancing is trying to prevent in sparse experts

Expert imbalance in MoE models produces two failure modes with distinct consequences. The first is routing collapse, described in the DeepSeek-V3 report as the scenario where tokens concentrate on a small subset of experts, reducing the effective capacity of the model to the compute footprint of those few experts. The second is throughput degradation under expert parallelism: when experts are sharded across devices, uneven load creates stragglers that idle most of the accelerators while waiting for the overloaded device to finish its batch. At the scale of DeepSeek-V3's training — thousands of H800 GPUs — a consistent 10–20% straggler penalty compounds across the 2.788M GPU-hour budget into a meaningful efficiency loss.

The report is direct about the mechanism: "An unbalanced expert load will lead to routing collapse (Shazeer et al., 2017) and diminish computational efficiency in scenarios with expert parallelism."

Watch Out: Expert imbalance is a training-process problem, not a benchmark score. A model can achieve strong perplexity on a benchmark while having experienced significant routing imbalance during training — the perplexity number alone does not tell you whether routing stayed healthy. Benchmark readers who see a good perplexity result and conclude "the routing worked well" are skipping an empirical step. DeepSeek-V3's paper provides routing-balance evidence separately; that evidence should be read alongside the benchmark numbers, not treated as implied by them.

How the bias-term approach changes expert selection

The auxiliary-loss-free method replaces the optimization-time penalty with a post-step correction to the gating score. During each forward pass, experts are selected according to a biased gating score — the raw learned affinity shifted by a per-expert bias term. After each training step, the bias terms are updated based on observed load: overloaded experts receive a downward adjustment, underloaded experts receive an upward adjustment. The update is independent of the gradient computation through the primary loss.

DeepSeek-V3 sets the bias update speed (\gamma = 0.001) for the first 14.3T tokens of training and then sets (\gamma = 0.0) for the final 500B tokens, effectively freezing the bias schedule at the end of training. The two-phase schedule is a concrete ablation point: the authors found that allowing bias updates throughout the full 14.8T-token run would interfere with final-stage convergence, suggesting the mechanism's role diminishes as routing patterns stabilize. That schedule detail — the phase boundary at 14.3T tokens — is the kind of implementation-specific signal that follow-up researchers need to replicate or vary in their own experiments.

Benchmark setup: what MLCommons is actually standardizing

MLCommons' MLPerf Training benchmarks define a standardized task as: train a reference model from scratch until it reaches a specified target quality metric, and measure the wall-clock time to do so across different hardware configurations. Participating organizations submit results using their hardware stacks (typically clusters of NVIDIA or AMD accelerators, or custom silicon) training the same reference model from the same starting point to the same target. The benchmark does not fix the hardware or the software stack — it fixes the model, the data, and the quality threshold.

For the DeepSeek-V3 benchmark added in MLPerf Training v6.0, the reference setup is the 671B-total / 37B-active MoE configuration, and the target metric is a log perplexity of 3.6. This is a pretraining-convergence benchmark: reaching 3.6 log perplexity on a fixed evaluation set signals that the model has absorbed enough of the training distribution to meet the standardized threshold. The benchmark does not evaluate downstream task performance, instruction-following quality, or any post-training capability.

BenchmarkTable	Specification	Value	Context
Benchmark suite	Task definition	MLPerf Training v6.0 large-scale pretraining	Train to a fixed quality target and measure wall-clock time
Reference model	Model scale	DeepSeek-V3 MoE	671B total parameters / 37B active per token
Target quality metric	Standardized threshold	3.6 log perplexity	Pretraining convergence on a fixed evaluation set
Training data scale	Reference corpus	14.8T tokens	Reference budget used in the DeepSeek-V3 report
Reference compute	Paper context	2.788M H800 GPU hours	Context for the scale of the training run

The metric choice matters: log perplexity on a fixed held-out set is a training-convergence proxy, not a measure of downstream utility. Two systems that both reach 3.6 log perplexity in different wall-clock times may produce models with indistinguishable downstream quality — or meaningfully different quality if the faster system reached the threshold via a different training trajectory. The benchmark measures infrastructure, not final model capability.

Reference model scale, token budget, and compute context

The DeepSeek-V3 technical report provides the training-scale numbers that MLCommons adopts as the reference context. The full training run consumed 14.8T tokens on a corpus covering code, mathematics, and multilingual text, completed in 2.788M H800 GPU hours. The paper frames this as efficient for the model's scale: "Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training."

For the MLPerf benchmark, the 14.8T-token corpus is the reference data budget and the 2.788M GPU-hour figure contextualizes what reaching the perplexity target required at DeepSeek's infrastructure. Participating organizations in MLPerf do not need to reproduce the full 14.8T-token run — they train to the specified quality threshold, which may be reached at a shorter token budget depending on system efficiency.

Pro Tip: The 2.788M H800 GPU-hour figure is a training-scale reference point from one lab's implementation, not a deployment cost or a universal budget estimate. It reflects DeepSeek's specific cluster topology, parallelism strategy, and training recipe. A different lab running the same model with different expert-parallel sharding or a different optimizer could require substantially more or fewer GPU hours to reach the same perplexity target. Treat it as an order-of-magnitude calibration for frontier MoE pretraining, not a reproducibility specification.

Why a pretraining benchmark is different from a post-training eval

Pretraining benchmarks and post-training evaluations measure fundamentally different properties. MLPerf Training measures training system throughput: given a fixed model, fixed data, and a fixed quality threshold, how fast does your hardware converge? Post-training evaluations — MMLU, HumanEval, MT-Bench, and their variants — measure the downstream capability of a trained model checkpoint.

"The MLPerf Training benchmark suite measures how fast systems can train models to a target quality metric," MLCommons says. These two measurements are related but not equivalent. A system that reaches 3.6 log perplexity faster does not necessarily produce a model with higher MMLU or better code generation than a slower system that reaches the same perplexity. Perplexity is a proxy for distributional fit, not a direct measure of task performance.

Watch Out: Do not import MLPerf Training benchmark results into claims about downstream model quality. A hardware vendor or lab that achieves the fastest time-to-3.6-perplexity result in MLPerf Training v6.0 has demonstrated efficient sparse-model training infrastructure. They have not demonstrated that their resulting model is a better assistant, a more capable coder, or a safer deployment target than a model trained more slowly on the same architecture. These are separate empirical questions requiring separate evaluations.

Results from the DeepSeek-V3 report that matter for benchmark readers

The DeepSeek-V3 technical report's key numbers for benchmark-oriented readers cluster around three axes: scale, stability, and efficiency. The model architecture is 671B total parameters with 37B activated per token, trained on 14.8T tokens using 2.788M H800 GPU hours. The paper reports that training was stable throughout: "Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks." Additionally, "In addition, its training process is remarkably stable." and "its training process is remarkably stable."

BenchmarkTable	Metric	Reported value	Source
Scale	Total parameters	671B	DeepSeek-V3 Technical Report
Scale	Active parameters per token	37B	DeepSeek-V3 Technical Report
Efficiency	Training token budget	14.8T	DeepSeek-V3 Technical Report
Efficiency	Training compute	2.788M H800 GPU hours	DeepSeek-V3 Technical Report
Benchmark target	MLPerf perplexity target	3.6 log perplexity	MLCommons Benchmark Page

The stability claim is significant for benchmark interpretation: the absence of irrecoverable loss spikes across a 14.8T-token run is evidence that the auxiliary-loss-free routing mechanism did not destabilize training at this scale. Prior MoE training runs at comparable scale have reported routing instability, particularly when load-balancing hyperparameters were mistuned. DeepSeek-V3's clean training curve is a positive data point for the method, though it is one data point on one hardware stack.

671B total parameters, 37B active: what that means in practice

The 671B / 37B framing describes the sparse-activation property of MoE models: the full parameter count resides in the model's expert weights, but each token's forward pass only activates a fraction of them. The activation ratio is approximately 5.5%:

$(\text{activation ratio} = \frac{37\text{B}}{671\text{B}} \approx 0.055)$

In practice this means the per-token compute cost is calibrated to a 37B-parameter dense equivalent, while the model's total representational capacity is that of a 671B-parameter model. The MoE premise is that a 37B active-parameter forward pass through a 671B-parameter routing space should produce better predictions than a 37B dense model, because different tokens activate different experts specialized to their content.

For benchmark readers, the 37B active-parameter figure is the operationally relevant compute cost per token. The 671B total-parameter figure governs memory requirements for the weight tensors — the full model must reside in GPU memory even if only 5.5% of it executes per forward pass. At H800-class memory capacity (~80 GB HBM3 per GPU), serving or training the full 671B weight tensor requires substantial multi-node expert parallelism regardless of the per-token FLOPs.

What the paper claims about training stability and cost

DeepSeek-V3's two most cited training claims are the stability record (no irrecoverable spikes, no rollbacks across 14.8T tokens) and the compute efficiency figure (2.788M H800 GPU hours). Together, these characterize the training run as both stable and unusually efficient for its scale. The paper attributes both properties partly to the auxiliary-loss-free routing design and partly to FP8 mixed-precision training, which reduces memory bandwidth pressure and allows higher batch throughput.

The verified scale of the run belongs here as well: DeepSeek-V3 trained on 14.8T tokens and completed the full run in 2.788M H800 GPU hours, which is the same context MLCommons uses when standardizing the benchmark target. That pairing matters because the benchmark-facing interpretation depends on both the convergence target and the cost to reach it.

Pro Tip: The 2.788M H800 GPU-hour figure implies efficiency relative to the model's scale, but these numbers are evidence of what one lab achieved under one set of infrastructure choices — not a forecast for other organizations. Variables that would shift the GPU-hour total include cluster interconnect topology (NVLink vs. InfiniBand bandwidth affects expert-parallel all-to-all communication directly), batch size and gradient accumulation strategy, the FP8 implementation quality, and dataset preprocessing overhead. The number is useful for calibrating magnitude; treat any reproduction that claims to match it without publishing comparable infrastructure details with skepticism.

What the benchmark results do not prove about future MoE models

DeepSeek-V3's entry into MLPerf Training v6.0 establishes a concrete large-scale MoE benchmark, but the results it will generate — and the results the paper itself reports — are bounded claims about one reference implementation, one hardware generation, and one training recipe. The MoE benchmarking field has a pattern of over-reading single strong results as architectural verdicts. The DeepSeek-V3 case requires more care.

MLCommons adds DeepSeek-V3 as a benchmark addition, not as an architectural endorsement. The benchmark announcement is explicit: it is testing how fast systems train this specific model to this specific quality threshold. Whether auxiliary-loss-free routing generalizes to different expert counts, different routing fanout (top-$K$ values other than those DeepSeek-V3 uses), different data distributions, or different hardware interconnect topologies is not answered by the benchmark result.

Watch Out: A single benchmark result from one auxiliary-loss-free MoE implementation cannot establish a universal superiority claim for sparse over dense models, or for bias-based routing over auxiliary-loss routing. DeepSeek-V3 demonstrates that this design is stable and efficient at this scale on this hardware. Researchers designing new MoE architectures should treat it as a strong design point, not as a validated general theory of sparse-model training.

Why strong benchmark numbers are not the same as a general routing theory

The auxiliary-loss-free method's reported success in DeepSeek-V3 arises from a specific combination of choices: the bias-update rule, the two-phase (\gamma) schedule (active for 14.3T tokens, frozen for the final 500B), the expert count and routing fanout in DeepSeek's MoE configuration, and the training data mixture. The bias-based routing mechanism — selecting experts by biased gating scores and adjusting biases post-step — is a design-specific procedure, not a theorem.

Strong benchmark numbers confirm that the method works in its tested regime. They do not confirm that removing auxiliary losses improves MoE training in general, because the comparison class matters: auxiliary-loss-free routing vs. carefully tuned auxiliary-loss routing vs. token-choice routing with capacity buffers are three distinct points in routing-design space. The DeepSeek-V3 paper does not conduct exhaustive head-to-head ablations against all alternatives at full scale.

Pro Tip: Before adopting auxiliary-loss-free routing in a new MoE experiment, compare against at least one well-tuned auxiliary-loss baseline at the same model scale and data budget. The bias-update mechanism's advantage may be sensitive to expert count, routing fanout, or data distribution in ways the DeepSeek-V3 paper doesn't fully characterize. The original auxiliary-loss-free MoE paper provides a starting point for baseline comparisons.

Where the evidence is still thin for practitioners

The DeepSeek-V3 benchmark success does not settle the question of whether MoE models are preferable to dense models across practical workloads. MLCommons' MLPerf Training benchmarks measure training convergence efficiency under a fixed quality target — they do not compare MoE and dense models on downstream accuracy, inference latency, serving cost, or fine-tuning behavior.

A practitioner choosing between a 37B dense model and a 671B/37B-active MoE model for a production workload faces tradeoffs the benchmark doesn't address: the MoE requires more total GPU memory for weight storage, introduces routing overhead per forward pass, and may exhibit different fine-tuning stability than a dense model. The benchmark result confirms that the MoE architecture trains efficiently to a pretraining quality target; it does not confirm that it is the better choice for a specific downstream application.

Watch Out: Benchmark success on a pretraining convergence target does not resolve architecture choice across production workloads. The MoE vs. dense decision depends on serving infrastructure constraints, fine-tuning data volume, inference latency targets, and memory budget — none of which the MLPerf Training v6.0 benchmark directly measures.

How researchers should read the MLCommons angle

MLCommons positions itself as providing open, industry-standard benchmarks for measuring AI quality, performance, and risk. DeepSeek-V3's inclusion in MLPerf Training v6.0 means the model joins a standardized benchmark corpus rather than remaining solely a paper result. That transition matters for three distinct reasons.

"At MLCommons, we democratize AI through open, state-of-the art industry-standard benchmarks and data tooling to measure quality, performance, and risk." MLCommons makes that mission explicit. The benchmark framing matters because it turns DeepSeek-V3 from a one-off report into a shared reference point for sparse-training infrastructure.

Dimension	Paper-level claims	Benchmark standardization	What MLPerf-style reuse adds
Measurement	Self-reported perplexity on proprietary eval set	3.6 log perplexity on standardized target	Cross-lab comparability under fixed quality bar
Hardware scope	One cluster (H800, DeepSeek's topology)	Any MLPerf-submitting hardware stack	Multi-vendor, multi-accelerator training efficiency data
Reproducibility	Training recipe partially disclosed	Reference model and target fixed for all submitters	Structural incentive for open methodology disclosure
Generalization	Single training run	Multiple independent implementations	Identifies which results are infrastructure-dependent vs. robust

The paper-level result tells you what DeepSeek achieved on their cluster. The MLPerf benchmark result, once submissions arrive, tells you what different hardware stacks achieve on the same model and target — which is the comparison that reveals whether DeepSeek's efficiency numbers are infrastructure-specific or architecture-driven.

Why standardization matters for frontier sparse training

Frontier MoE training at 671B-parameter scale is not yet a commodity operation. Expert parallelism, all-to-all communication patterns, FP8 numerical formats, and dynamic load balancing are implementation details that vary substantially across labs and hardware vendors. Without a standardized benchmark, efficiency claims are impossible to compare: one lab's "2.788M GPU hours" uses NVLink-dense clusters; another's might use InfiniBand-connected islands with different all-to-all bandwidth profiles. The resulting numbers are not comparable without the benchmark's controlled methodology.

Pro Tip: MLCommons' open benchmark methodology creates a structural incentive for labs to disclose enough of their implementation to submit reproducible results. For sparse-model training specifically — where expert-parallel communication patterns are a dominant performance variable — this means the benchmark generates cross-hardware data that the field cannot produce through individual paper publications alone. Researchers studying frontier MoE infrastructure should track MLPerf Training v6.0 submission results, not just the DeepSeek-V3 paper, as the submission round closes.

Implications for frontier MoE experiments and survey writing

DeepSeek-V3 provides the MoE field with its most complete large-scale design point as of mid-2026: auxiliary-loss-free routing, multi-token prediction objective, FP8 training, 671B total / 37B active parameters, 14.8T training tokens. Survey writers and experiment designers need to handle these contributions as separable variables rather than a single package.

The architectural contributions — auxiliary-loss-free routing and multi-token prediction — are independently interesting design choices. The benchmark outcome — stable training to a perplexity target at 2.788M H800 GPU hours — is a consequence of the full recipe. Attributing the benchmark result to any single design choice without ablations overstates the evidence.

DecisionMatrix	Choose this when	Use this approach	Tradeoff
Auxiliary-loss-free routing	Training runs are long enough (>10T tokens) for the two-phase (\gamma) schedule to operate	Bias updates keep expert load balanced without an auxiliary objective	Requires load monitoring and a freeze point that may be model-specific
Classic auxiliary-loss routing	You need tighter control over routing entropy or shorter runs	Add a balancing penalty to the loss and tune its scale directly	Introduces an extra hyperparameter that can compete with prediction loss
Auxiliary-loss-free routing	Expert-parallel communication is the primary bottleneck	Use post-step bias corrections to reduce routing collapse risk	Advantage may weaken when expert count or data mixture shifts
Classic auxiliary-loss routing	You lack per-expert load monitoring	Keep the auxiliary loss as a training-time safety net	Less clean optimization surface during comparison runs

Routing dynamics, ablation reading, and what to compare next

The most informative follow-up experiments from the DeepSeek-V3 result are not replications — they are ablations. The (\gamma) schedule (0.001 for 14.3T tokens, 0.0 for the final 500B) is the most actionable single variable. Does freezing (\gamma) at 14.3T tokens improve final perplexity? What happens at 10T or 12T? Does the optimal freeze point correlate with when routing entropy stabilizes?

The paper reports no irrecoverable loss spikes and no rollbacks across the full 14.8T-token run — a stability claim that is the strongest positive result for auxiliary-loss-free routing at this scale. But a single stable run on one hardware configuration is not sufficient evidence to characterize the mechanism's stability margin. The failure boundary — what (\gamma) values, expert counts, or data distributions produce instability — remains uncharacterized in the published work.

Pro Tip: The highest-value ablations for this work compare: (1) auxiliary-loss-free routing vs. well-tuned auxiliary-loss routing at the same scale on the same data; (2) the two-phase (\gamma) schedule vs. a constant (\gamma) throughout training; (3) the bias-update rule under different expert counts and routing fanout values. Without at least comparison (1), the efficiency gain attributed to auxiliary-loss-free routing cannot be separated from other recipe choices. If you are designing an MoE experiment to evaluate routing methods, structure it to isolate these variables — and report routing load statistics as first-class outputs alongside perplexity.

Questions researchers keep asking about DeepSeek-V3

What is auxiliary-loss-free load balancing in DeepSeek-V3?

It is a routing mechanism that maintains expert load balance by adjusting per-expert bias terms in the gating score after each training step, rather than by adding a balancing penalty to the training loss. Expert selection uses the bias-adjusted score; biases update based on observed load deviation from target, with update speed (\gamma = 0.001) for the first 14.3T tokens and (\gamma = 0.0) for the final 500B tokens. This removes the optimization tension between prediction quality and routing balance that auxiliary-loss methods introduce.

Why is DeepSeek-V3 being used as an MLCommons benchmark?

MLCommons added DeepSeek-V3 to MLPerf Training v6.0 because it is a publicly documented, large-scale MoE model with a reproducible reference configuration (671B/37B, 14.8T tokens, target log perplexity of 3.6). It is the first frontier-scale auxiliary-loss-free MoE in the benchmark suite, enabling cross-hardware comparison of sparse-model training infrastructure at a scale previously absent from standardized benchmarks.

How many parameters does DeepSeek-V3 have and how many are active per token?

DeepSeek-V3 has 671B total parameters and 37B activated per token — an activation ratio of approximately 5.5%. The 37B active-parameter figure determines per-token compute cost; the 671B total-parameter figure determines memory requirements for storing all expert weights.

Does DeepSeek-V3 prove MoE models are better than dense models?

No. The MLPerf Training benchmark measures time to reach a 3.6 log perplexity target for one reference setup. It does not compare MoE and dense architectures on downstream tasks, inference cost, fine-tuning behavior, or serving latency. DeepSeek-V3 demonstrates that a well-designed auxiliary-loss-free MoE can train stably and efficiently at frontier scale; it does not establish universal MoE superiority across workloads.

What does the DeepSeek-V3 benchmark measure exactly?

MLPerf Training v6.0 measures the wall-clock time for different hardware systems to train DeepSeek-V3 from initialization to a 3.6 log perplexity threshold on a standardized evaluation set. It is a training-infrastructure benchmark — it reveals which hardware stacks train sparse models most efficiently, not which model or architecture performs best on downstream tasks.

Sources and reference material

DeepSeek-V3 Technical Report (arXiv:2412.19437) — Primary paper describing the 671B/37B MoE architecture, auxiliary-loss-free routing, multi-token prediction objective, FP8 training, and the 14.8T-token training run
DeepSeek-V3 Technical Report — full HTML version — Extended HTML rendering with routing load-balancing schedule details and training stability discussion
MLCommons MLPerf Training v6.0 — DeepSeek-V3 Announcement — Official MLCommons announcement adding DeepSeek-V3 as a large-scale pretraining benchmark
MLCommons MLPerf Training Benchmark Page — Benchmark definition, task list, and quality targets including the 3.6 log perplexity target for DeepSeek v3 v6.0
MLCommons Organization Page — Canonical source for MLCommons mission, benchmark suite scope, and open-standard methodology
Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts (arXiv:2408.15664v1) — The foundational paper describing the biased gating score mechanism adopted in DeepSeek-V3

Keywords: DeepSeek-V3, DeepSeek-V2, MLCommons MLPerf Training v6.0, Mixture-of-Experts (MoE), Multi-head Latent Attention (MLA), auxiliary-loss-free load balancing, multi-token prediction, FP8 training, NVIDIA H800, 2.788M H800 GPU hours, 14.8T tokens, 671B total parameters / 37B active parameters, DeepSeekMoE, MLPerf Training

Was this guide helpful?

Share: X · LinkedIn · Reddit