AI & ML

DeepSeek-V3 auxiliary-loss-free load balancing: how the router changes MoE training

DeepSeek-V3 replaces the usual router auxiliary loss with a dynamically adjusted per-expert bias term for load balancing — preserving the load-balancing goal while avoiding the performance degradation the paper attributes to heavy auxiliary losses, but the benefit is tied to sequence-wise balance and node-limited routing rather than eliminating imbalance entirely.

By AxiomLogica Editorial

May 20, 202622 min read

Reviewed by Editorial

DeepSeek-V3 auxiliary-loss-free load balancing: how the router changes MoE training

What problem DeepSeek-V3's router is trying to solve

Standard MoE training has a load-balancing problem with no clean solution. If you do nothing, the router collapses onto a small subset of experts — popular experts get richer signal, grow more capable, attract more tokens, and the remaining experts atrophy. The standard fix is an auxiliary loss term appended to the training objective that penalizes uneven expert utilization. The problem with that fix is that the auxiliary loss adds a gradient whose direction may conflict with the primary language-modeling objective. Tune the auxiliary weight too high and you damage model quality; tune it too low and balance degrades. Practitioners treat this coefficient as an awkward knob they can never quite leave alone.

DeepSeek-V3 (arXiv:2412.19437) is a 671B-parameter MoE model with 37B parameters activated per token, pre-trained on 14.8 trillion tokens. The paper explicitly frames the router design as a departure from this auxiliary-loss pattern: "Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance." — DeepSeek-V3 Technical Report

The key move is replacing the auxiliary loss with a per-expert bias term that is adjusted dynamically during training. Load pressure is redirected into the routing scores themselves rather than into a competing gradient. This is the mechanism that most high-level explainers describe at a sentence or two and then abandon — and the section below is where this article goes further.

Bottom Line: Auxiliary-loss-free load balancing is a design trade-off, not a free win. It removes one source of gradient conflict but substitutes a control loop — the bias update — that introduces its own schedule sensitivity and requires residual sequence-level constraints to prevent collapse. The benefit is real, but it is contingent on the full system: node-limited routing, sequence-wise balance enforcement, and careful tuning of the bias update speed γ. DeepSeek-V3 Technical Report

How the auxiliary-loss-free routing scheme works at a high level

The DeepSeek-V3 router assigns each token-expert pair an affinity score, then adds a per-expert bias term to those scores before performing top-k selection. The bias term is not fixed — it is updated each training step in response to observed expert load. Overloaded experts receive a lower bias, making them less likely to be selected; underloaded experts receive a higher bias, drawing more tokens toward them. The feedback loop replaces the auxiliary gradient entirely for the bulk of training. DeepSeek-V3 Technical Report, arXiv HTML version

flowchart LR
    subgraph Router ["Router (per token, per MoE layer)"]
        T["Token hidden state h"] --> S["Affinity scores s_i = softmax(W_r · h)"]
        B["Per-expert bias b_i\n(updated each step)"] --> AS["Biased scores: s_i + b_i"]
        S --> AS
        AS --> TK["Top-K selection\n(K experts selected)"]
        TK --> D["Token dispatched to selected experts"]
    end

    subgraph LoadFeedback ["Load feedback loop"]
        D --> LC["Measure load: tokens routed to each expert this step"]
        LC --> BU["Bias update:\nb_i ← b_i − γ if overloaded\nb_i ← b_i + γ if underloaded"]
        BU --> B
    end

The restricted routing constraint — analogous to the device-limited routing in DeepSeek-V2 — bounds which experts each token can reach, deliberately limiting inter-node communication. As the DeepSeek-V3 Technical Report states: "DeepSeek-V3 also uses a restricted routing mechanism to limit communication costs during training."

The bias update speed γ is not static over the full run. DeepSeek-V3 Technical Report, arXiv HTML version states: "For auxiliary-loss-free load balancing, we set the bias update speed γ to 0.001 for the first 14.3T tokens, and to 0.0 for the remaining 500B tokens." It is set to 0.001 for the first 14.3T tokens, then dropped to 0.0 for the final 500B tokens — meaning the balancing pressure is deliberately reduced as training converges.

From router logits to biased expert scores

Before top-k gating, each expert $i$ has a raw affinity score (s_i) computed from the token's hidden state via a learned projection. The routing decision does not use (s_i) directly. Instead, the router selects the top-K experts based on (s_i + b_i), where (b_i) is the current bias for expert $i$.

Once the top-K set is determined, the actual gating weights used to combine expert outputs are computed from the unbiased scores (s_i) alone — the bias influences which experts are selected but not how much each selected expert contributes to the output. This asymmetry is critical: the bias shifts selection probabilities without distorting the magnitude of expert contributions. VitaLab DeepSeek-V3 Technical Summary

Pro Tip: The bias term preserves the load-balancing objective without injecting any auxiliary gradient into the training computation graph. The primary loss — cross-entropy over token predictions — sees only the standard router scores and gating weights. Balancing pressure operates entirely through the data-selection path, not through gradient flow. This is why removing the auxiliary loss does not mean removing load balancing.

How the per-expert bias update speed γ is used during training

The bias update rule operates at the step level. After each forward pass, the router counts how many tokens were dispatched to each expert. Experts that received more tokens than their fair share are penalized; experts that received fewer are rewarded.

The update rule takes the form:

$$ b_i \leftarrow b_i - \gamma \cdot \mathbf{1}[\text{expert } i \text{ is overloaded}] + \gamma \cdot \mathbf{1}[\text{expert } i \text{ is underloaded}] $$

where (\gamma) is the bias update speed and overloaded/underloaded is determined by comparing each expert's token count against the expected per-expert average for that step. The update is signed and additive — small adjustments accumulated over thousands of steps produce meaningful shifts in routing probability.

DeepSeek-V3 Technical Report, arXiv HTML version reports: "For auxiliary-loss-free load balancing, we set the bias update speed γ to 0.001 for the first 14.3T tokens, and to 0.0 for the remaining 500B tokens."

Setting γ to 0.0 for the final 500B tokens effectively freezes the bias values at whatever equilibrium the feedback loop reached. The router's selection behavior in the final stage reflects learned, stable affinities rather than active correction. This staged schedule has practical implications: the bias mechanism must converge to a good configuration before the update is disabled, and any instability in the first phase will be locked in rather than corrected.

Why this is not the same as deleting load balancing

Removing the auxiliary loss term from the training objective does not remove load balancing — it relocates it. The bias feedback loop still enforces distributional pressure on expert selection; the router still cannot ignore load signals. What changes is where that pressure lives in the computation graph: in the score space before dispatch rather than in the gradient of a secondary objective. DeepSeek-V3 Technical Report

DeepSeek-V3 also retains a small residual balance loss with coefficient α = 0.0001. The paper's purpose for this residual term is targeted: "For the balance loss, we set α to 0.0001, just to avoid extreme imbalance within any single sequence." This detail matters — it confirms that the bias mechanism alone is not sufficient to guarantee balance at all granularities.

Watch Out: Auxiliary-loss-free does not mean balance-free. DeepSeek-V3's router still applies active load correction via the bias update loop and still uses a small auxiliary loss to guard sequence-level concentration. The claim is that heavy auxiliary losses (with large α) interfere with the language-modeling gradient; it is not that load balancing is unnecessary. Conflating the two leads to misreading the design as simply removing a constraint.

Sequence-wise balance versus batch-wise balance

Most MoE implementations measure and enforce expert load as an aggregate statistic over a batch. DeepSeek-V3 explicitly distinguishes this from balance at the sequence level — and the distinction is not cosmetic.

Dimension	Batch-wise balance	Sequence-wise balance
What is measured	Token counts per expert averaged over all sequences in a batch	Token counts per expert within a single sequence
Where it is computed	Across the full mini-batch or step	Within each individual sequence's token set
What it can miss	Within-sequence concentration: one expert dominating all tokens of a specific input	Batch-level divergence across different input types
How DeepSeek-V3 addresses it	Bias update loop operating on step-level load counts	Residual balance loss (α = 0.0001) targeting per-sequence extremes
Why it matters for routing stability	Sufficient to detect systematic overload at scale	Required to detect pathological routing on individual inputs

The bias feedback loop primarily targets batch-level statistics — it observes per-step load and adjusts accordingly. But a token sequence with unusual content distribution can cause extreme within-sequence routing skew that averages out at the batch level and passes undetected by the bias loop. This is precisely why the residual α term persists.

What gets balanced within a sequence

Within a single sequence, a token's routing decision is made by the same biased score mechanism, but the bias values reflect historical load averaged over many prior steps — not the content of the current sequence. If a sequence contains a pathological distribution of tokens that all strongly prefer the same expert, the bias term provides no within-sequence correction; it only corrects prospectively, in future steps.

The small balance loss at α = 0.0001 provides intra-sequence correction directly. Its coefficient is low enough to contribute negligible gradient magnitude to the primary learning signal, but it prevents the most extreme cases — where, say, all 4,096 tokens in a long context route to a single expert per layer.

Pro Tip: In sparse Transformer routing, sequence-level concentration is distinct from layer-level or model-level collapse. A router that achieves excellent batch-wise balance can still produce degenerate routing on specific input types. Monitoring expert load at the per-sequence level during ablations — not just globally — surfaces pathologies that aggregate metrics conceal. DeepSeek-V3's retention of the α term implicitly acknowledges this measurement gap.

Why batch-level statistics can hide router pathologies

Batch-wise averaging of expert utilization is the standard diagnostic metric reported in MoE training logs. An expert that handles 1/N of total tokens across a batch appears perfectly balanced. But this aggregate can mask a scenario where that expert handles zero tokens on half the inputs and twice its share on the other half — a routing regime that may signal instability or poor generalization without appearing in headline metrics.

The implication for interpreting auxiliary-loss-based balancing claims is direct: a model trained with a large auxiliary loss coefficient can report near-uniform batch-level expert utilization while still exhibiting within-sequence concentration or cross-sequence routing pathology.

Watch Out: When evaluating MoE routing quality — whether under auxiliary-loss or bias-based balancing — aggregate batch statistics are necessary but not sufficient evidence. A model reporting 2% coefficient of variation on expert load over a training step may still exhibit severe within-sequence expert collapse on long documents or adversarial prompts. The auxiliary-loss literature has historically reported batch-level numbers; interpreting those numbers as evidence of routing health at finer granularities is unsupported.

Node-limited routing and why locality matters

Node-limited routing (NLR) is DeepSeek-V3's mechanism for bounding the communication cost of expert parallelism. Each token is constrained to route only to experts resident on at most M = 4 nodes. According to MLCommons, DeepSeek-V3 training note (2026-05): "Node-limited routing is enforced where each token is sent to a maximum of M=4 nodes."

This constraint is architectural, not incidental. In large-scale expert parallelism, every token that routes to an expert on a remote node requires an all-to-all communication round to dispatch the hidden state and retrieve the expert output. Without locality constraints, the router is free to scatter tokens across all nodes, and communication cost scales with the number of active nodes per token. Node-limited routing caps that cost deterministically.

The structure looks like this at inference scale:

flowchart TD
    subgraph "Expert parallel deployment (64 GPUs, 8 nodes)"
        N1["Node 1\nExperts 1–8"] 
        N2["Node 2\nExperts 9–16"]
        N3["Node 3\nExperts 17–24"]
        N4["Node 4\nExperts 25–32"]
        N5["Node 5\nExperts 33–40"]
        N6["Node 6\nExperts 41–48"]
        N7["Node 7\nExperts 49–56"]
        N8["Node 8\nExperts 57–64"]
    end

    T["Token"] -->|"Top-K selection\nrestricted to ≤ 4 nodes"| N1
    T --> N3
    T --> N5
    T --> N7
    N2:::disabled
    N4:::disabled
    N6:::disabled
    N8:::disabled

    classDef disabled opacity:0.3

Routed experts in DeepSeek-V3's expert parallel setup are deployed uniformly across 64 GPUs spanning 8 nodes, per MLCommons. A token selecting K experts from at most 4 of those 8 nodes touches at most half the cluster per routing decision.

How node-limited routing interacts with expert parallelism

Expert parallelism places different experts on different devices. Token dispatch requires sending a hidden state to whatever device hosts the chosen expert and receiving the output back before the residual stream continues. When experts are distributed across nodes, this dispatch crosses NVLink (intra-node, fast) or InfiniBand/Ethernet (inter-node, slower). Node-limited routing ensures that at most M nodes are involved per token per layer, bounding the worst-case inter-node traffic.

The MLCommons, DeepSeek-V3 training note (2026-05) confirms: "Expert Parallelism (EP): For a single layer, routed experts are uniformly deployed across 64 GPUs (8 nodes)." The node-limit of M = 4 means the router can never select experts that span more than half the deployment.

Production Note: The architecture section here explains the routing locality concept as it shapes training design. Specific bandwidth requirements, all-to-all schedules, and NCCL configuration for reproducing DeepSeek-V3's training setup depend on your exact hardware topology and are outside this article's scope. The locality constraint is a design principle; its implementation details are infrastructure-dependent.

What locality buys you and what it can cost

Node-limited routing reduces expected all-to-all communication volume by approximately a factor of (total nodes / M) in the best case — if the router distributes evenly within the M-node window. In the DeepSeek-V3 setup with 8 nodes and M = 4, this bounds worst-case inter-node dispatch to 4 hops rather than 8.

The cost is routing flexibility. An unconstrained top-K router can select the globally best-matching K experts for any given token regardless of their physical location. Node-limited routing forces the router to select the best K experts from a restricted candidate pool. If the globally optimal experts for a particular token all happen to reside on nodes outside the allowed M, those experts are unreachable.

Watch Out: Node-limited routing introduces an implicit curriculum effect during training. Tokens never see certain expert combinations simply because those combinations span too many nodes. This is a form of structural inductive bias in the routing distribution. It may help by reducing variance in communication patterns, but it reduces the theoretical expressivity of the routing function. Whether this constraint degrades model quality in practice depends on whether globally optimal routings would ever require more than M nodes — which the paper does not directly ablate.

Why DeepSeek-V3 treats auxiliary loss as a design trade-off

The standard MoE auxiliary load-balancing loss adds a term (\mathcal{L}_\text{aux}) to the total training objective:

$$ \mathcal{L}\text{total} = \mathcal{L}\text{LM} + \alpha \cdot \mathcal{L}_\text{aux} $$

where (\mathcal{L}\text{aux}) is typically a function of the fraction of tokens routed to each expert — penalizing deviation from uniform utilization. The gradient of (\mathcal{L}\text{aux}) with respect to the router's weight matrix (W_r) points toward more uniform routing. The gradient of (\mathcal{L}_\text{LM}) with respect to (W_r) points toward whichever routing decisions minimize prediction loss. These two gradients are not aligned by construction, and at high α they actively compete.

The interference manifests in two ways. First, updates that improve load balance may shift routing decisions away from expert assignments that minimize perplexity, degrading language-modeling performance. Second, when the language-modeling gradient drives routing toward expert specialization (assigning semantically related tokens to the same expert), the auxiliary gradient resists that specialization in proportion to α. The net result is a training signal with conflicting objectives that neither optimizes cleanly.

DeepSeek-V3 pre-trains on 14.8 trillion tokens using 2,048 NVIDIA H800 GPUs with FP8 mixed-precision training. At that scale, even modest interference from an auxiliary loss coefficient accumulates over billions of gradient steps. The bias-based alternative avoids this by removing (\mathcal{L}_\text{aux}) from the gradient computation entirely:

$$ \mathcal{L}\text{total}^{\text{DS-V3}} \approx \mathcal{L}\text{LM} + \underbrace{\alpha_\text{small} \cdot \mathcal{L}\text{seq-balance}} $$}

The router weight (W_r) receives gradients only from (\mathcal{L}_\text{LM}) (plus the negligible sequence-balance term). Load correction operates entirely through the bias update loop, which modifies routing decisions without touching the weight gradients.

The training signal problem auxiliary losses can create

The core issue is not that auxiliary losses are wrong in principle — at low α they provide useful regularization. The problem scales with model size and training length. A 671B-parameter model trained for 14.8T tokens amplifies every gradient interaction. An auxiliary loss weight that is tolerable for a 7B model may produce persistent quality degradation at 671B because the cumulative pull on (W_r) toward uniformity competes with the cumulative pull toward expert specialization over more steps, more layers, and more tokens.

Pro Tip: The gradient-interference argument does not mean auxiliary losses are categorically harmful — it means their cost-to-benefit ratio changes at frontier scale. For smaller MoE models with shorter training runs, auxiliary losses with moderate α are often the simpler and more predictable choice. The bias-based approach becomes relatively more attractive as training compute increases and as expert specialization becomes more valuable to the final model quality.

When the bias term is the better knob

The bias term is the preferable control when three conditions hold simultaneously: the training run is long enough that accumulated gradient interference from (\mathcal{L}_\text{aux}) would be material, the model has enough experts that fine-grained routing specialization carries quality value, and you have the infrastructure to monitor per-expert load at step granularity to tune γ.

DeepSeek-V3's reported configuration uses γ = 0.001 as the primary balancing lever for 96.6% of training (14.3T of 14.8T tokens), with α = 0.0001 as a guardrail for sequence-level extremes. The bias term is the primary control; the loss term is a safety net.

Bottom Line: For researchers designing new MoE routers: if your training budget exceeds ~100B tokens and your expert count is in the hundreds or more, moving load-balancing pressure from a loss term into the routing score space is likely worth the engineering cost of the bias update loop. Keep a small residual sequence-level balance loss — α = 0.0001 is the reported value — to prevent pathological within-sequence concentration that the step-level feedback loop cannot correct proactively.

What the paper's training results imply for frontier MoE design

Training property	Reported value
Total parameters	671B
Activated parameters per token	37B
Training tokens	14.8 trillion
Training hardware	2,048 NVIDIA H800 GPUs
Training precision	FP8 mixed-precision
Loss spikes requiring rollback	Zero (none reported)
Irrecoverable loss spikes	Zero (none reported)
Bias update speed γ (first 14.3T tokens)	0.001
Bias update speed γ (final 500B tokens)	0.0 (frozen)
Residual sequence-balance loss α	0.0001

The Hugging Face model card states: "Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks." Training stability at this token count is significant evidence that the router design does not introduce the kind of optimization instability associated with conflicting gradient signals — but it is correlational, not a controlled ablation demonstrating that the router change alone produced the stability.

Why stable pretraining matters more than a single headline metric

Routing-related training instability in MoE models typically manifests as loss spikes followed by partial recovery or router collapse where certain experts receive near-zero tokens. These events are costly at scale — recovery requires rollback to a checkpoint, re-running thousands of GPU-hours of compute. A clean training run with no such events is strong evidence that the routing mechanism does not create numerical or optimization pathologies.

Watch Out: The no-rollback claim is training evidence, not a causal proof. DeepSeek-V3's stable run reflects the combined effect of its full design: the bias-based router, node-limited routing, FP8 training, MLA attention, and multi-token prediction objective. Attributing the stability specifically to the auxiliary-loss-free router requires ablations that the public technical report does not include. The routing design is a plausible contributor; it is not the demonstrated sole cause.

Which parts of the design are likely to generalize to other MoE systems

The bias-update mechanism — replacing (\mathcal{L}_\text{aux}) with a step-level feedback loop on routing scores — is the most transferable element. It is architecture-agnostic in principle: any top-k router can add a per-expert bias term and update it based on observed load without modifying the underlying attention or FFN design. Implementations targeting Mixtral-style routers or custom sparse transformers can adopt this approach with modest engineering overhead.

The node-limited routing constraint is more system-specific. M = 4 over 8 nodes reflects DeepSeek-V3's specific hardware topology and expert count. A system with different node count, different expert parallelism degree, or different inter-node bandwidth characteristics will need to re-derive the appropriate locality constraint — the concept transfers but the parameters do not.

The γ schedule (active early, frozen late) reflects an assumption about convergence timing that depends on training length and data distribution. Shorter runs may need a different annealing profile; different data curricula may shift when the routing stabilizes.

Pro Tip: When adapting auxiliary-loss-free routing to a different MoE system, adopt the mechanism (bias update loop) but treat every numerical parameter — γ, α, M, and the γ-freeze schedule — as hyperparameters requiring re-tuning for your specific scale, topology, and token budget. The DeepSeek-V3 values are not transferable constants; they are the result of that team's specific experimental configuration.

Open questions and limitations in auxiliary-loss-free routing

The DeepSeek-V3 design leaves several questions unresolved that matter for researchers building on this approach.

Watch Out: Four specific limitations remain poorly characterized by the published results. First, whether sequence-wise balance enforcement at α = 0.0001 is sufficient across all input distributions — particularly very long documents, code, or multilingual content — is not ablated. The threshold is empirically chosen for the DeepSeek-V3 data mixture and may not generalize. Second, the γ-freeze schedule (setting γ = 0.0 for the final 500B tokens) assumes the bias values have converged to a stable configuration by 14.3T tokens; whether this holds for models trained on different corpora or with different expert counts is unknown. Third, routing collapse avoidance under auxiliary-loss-free routing at scales beyond 671B is untested — the mechanism's behavior in the extrapolative regime is not characterized. Fourth, the interaction between node-limited routing and auxiliary-loss-free balancing has not been disentangled: the two mechanisms together produce the reported stability, but their individual contributions to routing quality are not published separately. Researchers adopting one without the other should expect different behavior than the paper reports. DeepSeek-V3 Technical Report "For the balance loss, we set α to 0.0001, just to avoid extreme imbalance within any single sequence." "For auxiliary-loss-free load balancing, we set the bias update speed γ to 0.001 for the first 14.3T tokens, and to 0.0 for the remaining 500B tokens."

Questions researchers keep asking about DeepSeek-V3 routing

What is auxiliary-loss-free load balancing in DeepSeek-V3? It is a router design that replaces the standard auxiliary load-balancing loss term with a per-expert bias added to routing scores before top-k selection. The bias values are updated each training step in response to observed expert utilization — overloaded experts receive lower bias, underloaded experts receive higher bias — creating a feedback loop that enforces load balance without any gradient contribution from a secondary objective.

How does the DeepSeek-V3 router work? For each token at each MoE layer, the router computes affinity scores (s_i) between the token's hidden state and each expert. A per-expert bias (b_i) is added to produce selection scores (s_i + b_i), and the top-K experts by biased score are selected. Gating weights applied to expert outputs use the unbiased (s_i) values, so the bias influences selection but not contribution magnitude. Node-limited routing then constrains which experts are reachable based on their physical placement across nodes.

Why remove the auxiliary loss in MoE training? At frontier scale and training length, the auxiliary loss gradient competes with the language-modeling gradient for influence over the router's weight matrix. Accumulation of this conflict over trillions of training tokens can degrade model quality. Moving load-balancing pressure into the routing score space — via the bias update loop — decouples balance enforcement from gradient flow, eliminating the source of interference while preserving the balancing effect.

What is node-limited routing in DeepSeek-V3? Node-limited routing constrains each token to route only to experts resident on at most M = 4 nodes out of the 8-node, 64-GPU deployment. This bounds inter-node all-to-all communication to at most 4 communication hops per token per layer, capping the communication cost that would otherwise scale with unconstrained expert selection across the full cluster.

Does auxiliary-loss-free load balancing eliminate expert imbalance completely? No. The bias update loop enforces batch-level balance over training steps, but within-sequence concentration remains possible. DeepSeek-V3 retains a small residual balance loss at α = 0.0001 specifically to prevent extreme within-sequence expert concentration that the bias loop cannot correct proactively. The design reduces imbalance and removes gradient interference; it does not guarantee perfect balance at all granularities.

What is the bias update speed γ and how is it scheduled? γ controls the magnitude of each per-expert bias adjustment per step. DeepSeek-V3 uses γ = 0.001 for the first 14.3T training tokens, then sets γ = 0.0 for the final 500B tokens, freezing the bias values at their converged state for the end of training.

Sources and primary references

DeepSeek-V3 Technical Report (arXiv:2412.19437) — Primary source for all routing math, model scale, and training configuration
DeepSeek-V3 Technical Report, arXiv HTML version — Source for detailed bias update schedule, α and γ values, and node-limited routing description
DeepSeek-V3-Base Hugging Face Model Card — Source for training stability claim (no irrecoverable loss spikes, no rollbacks)
DeepSeek-V3 GitHub Repository — Official implementation reference
MLCommons — DeepSeek-V3 training note (May 2026) — Source for expert parallelism topology (64 GPUs, 8 nodes) and node-limited routing M = 4
VitaLab DeepSeek-V3 Technical Summary — Secondary summary source for bias-term framing and routing architecture description

Keywords: DeepSeek-V3, DeepSeek-V2, DeepSeekMoE, Multi-head Latent Attention (MLA), Mixture-of-Experts (MoE), top-k routing, auxiliary load-balancing loss, bias update speed (γ), sequence-wise balance, batch-wise balance, node-limited routing (NLR), NVIDIA H800, FP8 mixed-precision training, 14.8 trillion tokens, arXiv 2412.19437

Was this guide helpful?

Share: X · LinkedIn · Reddit