Why MoE++ exists for sparse expert routing
Bottom Line: Standard Mixture-of-Experts models force every token through an FFN expert, creating two compounding problems — expert load imbalance (a few experts absorb most tokens while others sit idle) and the All-to-All communication cost of dispatching tokens across GPUs in an expert-parallel setup. MoE++ addresses both by introducing zero-computation experts that can short-circuit the FFN dispatch entirely, keeping tokens local and reducing cross-GPU traffic. The reported payoff is 1.1–2.1× expert forward throughput over a vanilla MoE baseline of the same size. That gain is contingent on FFN compute being the actual bottleneck — if All-to-All communication already dominates your profile, MoE routing refinements alone won't rescue you.
MoE++ emerged from a straightforward observation: in a standard sparse MoE layer, every selected token must travel to an FFN expert, complete the full feed-forward computation, and return — even for tokens where a simpler operation (zeroing the output, copying the input unchanged, or substituting a fixed vector) would be sufficient. The paper's stated goal is to "simultaneously enhance the effectiveness and efficiency of Mixture-of-Experts methods" by integrating standard FFN experts with zero-computation alternatives inside a single, heterogeneous routing framework.
Expert-parallel MoE systems compound this inefficiency at scale. An arXiv paper on shortcut-connected expert parallelism characterizes the All-to-All communication pattern as "a significant bottleneck limiting MoE efficiency." When tokens must be routed across nodes — through NCCL collectives over NVLink or RDMA fabric — the per-expert-dispatch latency accumulates in proportion to the number of active FFN experts. MoE++'s architectural answer is to give the router a third category of outcome: routing to a zero-computation expert that requires no dispatch and no FFN forward pass. The paper is presented as a general and heterogeneous MoE framework and received an oral presentation at ICLR 2025.
How the MoE++ routing stack is organized
MoE++ layered architecture places the router at the center of the token path, where each hidden state is sent either to FFN experts on the compute branch or to zero-computation experts on the zero-cost branch. This is not a sequential pipeline where zero-cost experts follow FFN experts — it is a single router decision that determines which branch a token enters.
flowchart LR
T["Token hidden state h_l"] --> R["Router\n(gating scores + gating residual from l-1)"]
R -->|"top-k FFN score wins"| FFN["FFN Expert(s)\n(distributed, compute-heavy)"]
R -->|"zero-cost score wins"| ZC["Zero-Computation Expert\n(zero / copy / constant)"]
FFN --> AGG["Weighted aggregation"]
ZC --> AGG
AGG --> OUT["Output hidden state h_l+1"]
OUT -->|"routing pathway signal"| GR["Gating Residual\n(carried to layer l+1 router)"]
The diagram above shows the token path from router to FFN experts and the zero-cost branch in one view. The router simultaneously scores all candidate experts — both FFN and zero-computation types. A token's top-k selection can include any mix of these types, including zero-computation experts exclusively if their scores dominate. The key architectural guarantee is that zero-computation experts are "deployed on every GPU" because their parameter count is negligible, which means a token routed to a zero-cost expert never triggers a cross-GPU dispatch.
Gating residuals thread through this diagram at the bottom: the routing pathway signal from layer $l$ flows forward into the router at layer $l+1$, stabilizing expert selection across depth. This is the second major structural difference from vanilla MoE, and the two mechanisms — zero-computation experts and gating residuals — are designed to work together.
Router output, top-k selection, and where the zero-cost branch enters
Standard MoE routing computes a softmax or sigmoid over $N$ expert logits and selects the top-$k$ candidates, sending each token to those FFN experts with a gating weight proportional to the score. MoE++ extends the candidate set: the router scores both the $N$ FFN experts and the zero-computation experts in a single pass. The zero-cost branch enters at exactly the same decision point as the FFN branch — it is not a fallback or overflow mechanism applied after FFN dispatch.
This is the distinction that most existing summaries flatten. The SkyworkAI/MoE-plus-plus repository states it directly: "We introduce three types of zero-computation experts: the zero expert, copy expert, and constant expert, which correspond to discard, skip, and replace operations, respectively." All three are valid top-k outcomes from the same router.
Pro Tip: Zero, copy, and constant experts are routing outcomes, not additional FFN layers appended to the architecture. Adding them to the candidate set does not change the number of active FFN experts per token — it changes the probability that a given token reaches an FFN at all. A token that routes entirely to zero-computation experts performs zero FFN forward passes for that layer.
The routing score function itself follows the same learned linear projection the baseline MoE uses; what changes is that the learned weight matrix now has columns corresponding to zero-cost expert slots in addition to FFN expert slots. Depending on the token content and the model's learned routing preferences, any combination of FFN and zero-cost experts can appear in the top-k result.
Zero expert, copy expert, and constant expert
The three zero-computation expert types have distinct semantics and must not be collapsed into a generic "skip the FFN" label. Each represents a different intervention on the token's hidden state.
Zero expert (discard): The token's contribution through this expert is zeroed out. If a token selects only zero experts, its hidden state after the MoE layer is zeroed. In practice, the weighted aggregation across selected experts means a partial zero-expert selection attenuates the token's MoE output proportionally to the gating weight assigned to the zero expert. This is effectively a learned dropout over the MoE computation path.
Copy expert (skip): The token's input hidden state $h$ is passed through unchanged as the expert's output. This is the identity operation — the MoE layer neither transforms nor suppresses the token's representation. Tokens that route primarily to the copy expert are essentially bypassing the MoE transformation while still participating in the routing decision, preserving their pre-layer state rather than introducing new representational content.
Constant expert (replace): The expert outputs a fixed, learned vector regardless of the input token. This is conceptually a bias injection at the MoE layer — the token's hidden state is replaced (or partially mixed, weighted by the gating score) with a constant representation. The constant vector is a trained parameter, so it captures a dataset-level or layer-level signal rather than token-specific content.
Watch Out: Discard, skip, and replace are three distinct operations with materially different effects on the token's representational trajectory. Discard suppresses the MoE output (useful when the MoE layer should contribute nothing for this token). Skip preserves the pre-layer state through the MoE path (useful when the token's representation is already well-formed). Replace injects a fixed signal (useful when a layer-level constant correction improves downstream task routing). Flattening all three into "zero-computation expert" loses the ability to reason about which token types benefit from which operation and will misrepresent how the model uses each branch during inference.
Why local zero-computation experts matter in deployment
Zero-computation experts eliminate cross-GPU dispatch for any token that routes to them. In expert-parallel MoE serving, the standard flow requires the dispatcher to send each token to the GPU holding its selected FFN expert, execute the forward pass remotely, and return the result — an All-to-All pattern that grows expensive as the number of expert-parallel ranks increases.
Production Note: The MoE++ paper explicitly frames zero-computation experts as "deployment friendly" because their negligible parameter count makes it practical to replicate them on every GPU in the expert-parallel group. When a token routes to a zero-cost expert, no dispatch occurs — the operation executes locally. This directly reduces the volume of expert-dispatch traffic handled by NCCL collectives over NVLink or RDMA links, targeting the "significant bottleneck" that All-to-All communication represents in distributed MoE. The communication reduction is proportional to the fraction of tokens routed to zero-cost experts in each layer; it is not a blanket elimination of All-to-All traffic, since FFN-routed tokens still require dispatch.
The practical implication for expert-parallelism topology: zero-cost expert locality allows the serving system to treat a fraction of tokens as "local compute" in every layer, decoupling their latency from the inter-node communication path entirely.
Expert load imbalance also benefits. Because zero-cost experts can absorb tokens that would otherwise overload hot FFN experts, the effective load on the FFN expert pool is redistributed. The router learns which tokens are best served by FFN experts and which can be handled locally, reducing the token-concentration problem that auxiliary load-balancing losses were previously the only tool to address.
Gating residuals and routing stability across layers
Gating residuals feed the routing decision at layer $l+1$ with information about which pathway the token followed at layer $l$. The mechanism targets routing-score variance: the SkyworkAI/MoE-plus-plus README states that gating residuals "effectively establish connections between different MoE++ layers and reduce the variance of routing scores" while they "do not change the mean and range of values of the routing scores." The same repository wording also notes that gating residuals "reduce the variance of routing scores" and "do not change the mean and range of values of the routing scores," which is the relevant stability property for this design.
The conceptual formulation: let (g_l) denote the gating score vector for a token at layer $l$ (over all expert candidates). The gating residual carries a transformed version of (g_l) forward, mixing it into the raw router logits at layer $l+1$ before the top-k selection:
$$ \hat{g}{l+1} = \text{Router}(h) + \alpha \cdot \phi(g_l) $$
where (\text{Router}(h_{l+1})) is the standard learned routing score from the current hidden state, (\phi(g_l)) is a projection of the previous layer's gating scores (the pathway signal), and (\alpha) is a mixing coefficient. The final gating scores at layer $l+1$ are then normalized and passed to top-k selection. This formulation is a conceptual summary reflecting the paper's described mechanism; the ICLR 2025 oral abstract confirms the design as "enabling each token to consider the pathway taken in the previous layer when selecting the appropriate experts."
The key property that makes this useful: the routing decision at each layer is not made in isolation. A token that followed the copy-expert path at layer $l$ carries that signal into layer $l+1$'s routing, increasing the probability of routing continuity where continuity is beneficial. Similarly, a token that passed through a specific FFN expert at layer $l$ carries evidence of its computational history into the next routing decision.
What signal is carried forward from the previous layer
The signal propagated by gating residuals is the token's routing pathway — specifically, which expert branch it followed and with what gating weight — not the token's hidden-state value. This is the critical distinction from ordinary transformer residual connections.
Standard residual connections add the pre-layer hidden state $h$ to the post-layer output, preserving gradient flow through depth. Gating residuals operate on the router's output space, not the token's representation space. The previous layer's routing scores (or a function of them) inform the current layer's router logits without altering the hidden state that feeds into the FFN or attention computation.
Pro Tip: Gating residuals are routing-state carryover, not token-representation carryover. A vanilla residual connection says "preserve what the token looked like before this layer." A gating residual says "remember where this token went in the previous layer when deciding where to send it now." The two mechanisms operate on different spaces and serve different functions — conflating them leads to incorrect intuitions about what happens when gating residuals are ablated.
The practical effect is that routing decisions become temporally correlated across layers. A token establishes a routing identity over depth rather than making an independent expert selection at each layer from scratch. This is especially relevant for zero-computation experts: a token that routed to the copy expert at layer $l$ signals to the layer $l+1$ router that its representation may still be in a state suited to bypass, without forcing that decision through a hard rule.
Why gating residuals reduce score variance and routing jitter
Routing jitter — layer-to-layer oscillation in expert selection for the same token — is a known failure mode in deep sparse models. When each layer's router operates independently on a hidden state that changes with each FFN or attention transformation, a token can rapidly alternate between expert assignments, degrading the model's ability to develop coherent layer-wise specialization.
Gating residuals dampen this oscillation by injecting prior-layer routing signal into the current logits, biasing the score distribution toward continuity. The authoritatively verified effect, per the SkyworkAI README, is reduced routing-score variance with unchanged mean and range — the distribution of scores stays in the same numeric regime (no scale shift), but becomes tighter around its center. Practically, this means top-k selection is more stable across layers without collapsing to deterministic routing.
Watch Out: Lower routing-score variance is not unconditionally beneficial. If the mixing coefficient (\alpha) is too large or the pathway signal dominates the current-layer router logits, the system can develop routing inertia — tokens become locked into the same expert assignment across many layers regardless of how their hidden state evolves. This limits the model's ability to adapt routing to changing token semantics through depth. The MoE++ design preserves mean and range precisely to avoid this failure mode, but the risk scales with (\alpha) and the depth of the model. In very deep stacks, monitor per-layer expert utilization distributions independently to detect inertia before it degrades perplexity on semantically diverse inputs.
A secondary benefit flows from the interaction between gating residuals and zero-computation experts. Because zero-cost expert selections are themselves part of the pathway signal carried forward, a token that has been consistently routed to copy or constant experts at earlier layers will carry that signal as evidence that bypass behavior is appropriate. This makes the zero-computation expert branch self-reinforcing where appropriate, rather than requiring each layer's router to rediscover the bypass independently.
Where MoE++ changes the compute and communication profile
MoE++ changes two distinct parts of the cost profile simultaneously: it reduces the average FFN compute per token (by routing some tokens to zero-cost experts), and it reduces the All-to-All dispatch volume (by keeping zero-cost-routed tokens local). The net effect on expert forward throughput is benchmark-dependent and varies with how much of the baseline cost comes from FFN compute versus dispatch overhead.
| Configuration | Expert Compute Cost | All-to-All Dispatch Volume | Load Balance | Throughput Implication |
|---|---|---|---|---|
| Vanilla MoE (top-k FFN only) | Full FFN per token per layer | All tokens dispatched | Expert imbalance common | Baseline (1×) |
| MoE++ with distributed FFN + zero-cost local | Reduced (fraction bypasses FFN) | Reduced (zero-cost tokens local) | Improved (zero-cost absorbs overflow) | 1.1–2.1× (paper reported) |
| MoE++ with communication-dominant topology | Reduced FFN compute | Dispatch overhead still dominates | Marginally improved | Gains smaller than reported |
The reported expert forward throughput improvement of 1.1–2.1× versus a vanilla MoE model of the same size captures the regime where FFN computation and dispatch overhead both contribute meaningfully to the bottleneck, and where zero-cost experts are replicated locally. The upper end of this range corresponds to configurations where a substantial fraction of tokens route to zero-computation experts, eliminating both their FFN cost and their dispatch cost in a single routing decision.
When zero-computation experts improve throughput
Zero-computation experts improve expert forward throughput under two joint conditions: FFN computation is a meaningful fraction of the per-layer cost (not drowned out by other overheads), and a sufficient fraction of tokens route to zero-cost experts to materially reduce the dispatch volume.
Production Note: The 1.1–2.1× expert forward throughput figure is paper-reported benchmark evidence from the MoE++ authors, not a guarantee across Mixtral- or DeepSeek-V3-style production stacks. Expert forward throughput measures the FFN compute and dispatch path specifically — it is not end-to-end application latency or tokens-per-second at the serving API level. Treat this number as a directional signal for configurations where FFN experts are the bottleneck and zero-cost experts are deployed locally on every GPU. If your system's dominant cost is attention computation, KV cache management, or inter-node communication for reasons unrelated to expert dispatch, the throughput gain will be smaller.
When the design stops helping
MoE++ stops providing meaningful gains when the system's dominant cost lies outside the FFN expert compute and dispatch path. Two specific failure modes apply.
First, if All-to-All communication already dominates the latency profile — because the expert-parallel group spans slow inter-node links, or because the FFN experts themselves are small relative to the dispatch overhead — reducing FFN compute per token does not shift the bottleneck. The communication bottleneck identified in the expert-parallelism literature requires hardware-level interventions (NVLink topology, IBGDA, topology-aware scheduling) to address; routing changes that reduce dispatch volume are necessary but not sufficient when the remaining dispatch volume still saturates the interconnect.
Watch Out: Do not apply MoE++'s throughput claims to every sparse model. Uniform token behavior — where all tokens tend to route to the same FFN experts regardless of content — means the zero-cost branch is rarely activated, eliminating its efficiency benefit. Architectures with naturally high routing entropy benefit most from the additional routing diversity MoE++ introduces. If your routing entropy is already low (visible as high expert concentration in router utilization logs), MoE++'s additional routing choices may not engage meaningfully, and the complexity cost of the heterogeneous routing stack yields no return.
Second, the zero-cost expert benefit vanishes if zero-cost experts are not replicated locally. Any deployment that distributes zero-cost experts across GPUs like standard FFN experts reintroduces dispatch traffic for zero-cost-routed tokens, negating the communication savings entirely.
What MoE++ means for Mixtral- and DeepSeek-style production stacks
Infrastructure engineers running Mixtral or DeepSeek-V3 MoE serving pipelines face the same structural constraints MoE++ targets: expert imbalance under dynamic load, All-to-All dispatch overhead at scale, and the difficulty of debugging sparse routing pathways. MoE++ is not a drop-in optimization for existing checkpoints — it requires training with the heterogeneous routing framework from the start. But its architectural choices map directly to the pain points production MoE stacks encounter.
Decision Matrix: MoE++ adoption fit
| Scenario | Criteria | Decision |
|---|---|---|
| New sparse MoE training | FFN expert compute is a confirmed bottleneck; zero-cost experts can be replicated locally; routing observability is available | Choose MoE++ |
| Existing Mixtral or DeepSeek-V3 deployment | You want to reuse pretrained weights without retraining the router and expert stack | Consider with caution |
| Communication-dominant serving stack | All-to-All is already the limiting factor and topology-aware optimizations are not exhausted | Defer |
| Narrow homogeneous token distribution | Routing entropy is low and zero-cost branches are unlikely to activate often | Avoid |
Expert parallelism remains the scaling axis for large MoE deployments, and the 2025–2026 systems literature — including NVIDIA's Hybrid-EP work targeting "near hardware-limit bandwidth" — treats All-to-All traffic as the central scaling constraint. MoE++ addresses one component of that constraint at the routing level; the hardware-level component requires NVLink, Quantum InfiniBand, Spectrum-X Ethernet, and RDMA-aware transport separately.
Which pain points MoE++ addresses first
MoE++ targets three MoE-specific pain points in order of directness: expert load imbalance, cross-GPU dispatch volume, and routing instability across depth.
Load imbalance is the first-order target. By giving the router a zero-cost alternative for tokens that would otherwise overload hot FFN experts, the model can learn to redirect overflow tokens to local zero-cost experts rather than forcing them through congested FFN slots. This reduces the auxiliary load-balancing loss's burden — the zero-cost branch provides a natural pressure valve that auxiliary losses alone cannot create.
Dispatch volume is the second-order target, addressed by the local-replication property of zero-cost experts described above.
Routing instability is addressed third, through gating residuals — reducing layer-to-layer routing jitter for tokens that have established a consistent expert pathway.
Pro Tip: All three of these pain points are consequences of MoE routing mechanics, not of model size or parameter count. Adding dense parameters to an MoE model does not resolve expert imbalance or dispatch overhead; it relocates them. MoE++ targets the routing layer directly, which is why its gains are reported as expert forward throughput rather than overall parameter efficiency. When profiling whether MoE++ is relevant to your system, measure routing entropy, per-expert load distribution, and dispatch-to-compute ratios — not aggregate FLOPs or model size.
What to verify before treating the paper as a deployment blueprint
Four assumptions underlie MoE++'s deployment story, and all four must hold simultaneously for the paper's efficiency claims to translate to production.
Production Note: First, verify that zero-computation experts are replicated on every GPU in your expert-parallel group — distributed placement reintroduces the dispatch traffic the design eliminates. Second, confirm that FFN expert compute is a measured bottleneck in your profiling, not assumed. Third, ensure your network topology supports the communication savings: local zero-cost expert execution only avoids inter-GPU traffic if the topology places GPUs in groups that share NVLink or equivalent high-bandwidth local fabric. Fourth, validate that the gating-residual mixing does not cause routing inertia in your model depth and token distribution — monitor per-layer routing entropy during training and check that expert utilization distributions evolve across layers rather than collapsing to a fixed pattern. The ICLR 2025 oral paper establishes these conditions implicitly; production deployment makes them explicit requirements.
The paper's description of zero-cost experts as having "negligible parameters" means local replication does not impose significant memory pressure — a constant vector per expert per layer is trivial relative to FFN parameter weight. But synchronization semantics for the constant expert's learned vector must be verified in your training and serving framework; it is a trained parameter and must be consistently updated and served across replicas.
FAQ on MoE++ routing and gating residuals
What are zero-computation experts?
Zero-computation experts are routing outcomes in MoE++ that require no FFN forward pass. There are three types: the zero expert (outputs zeros, discarding the token's MoE contribution), the copy expert (passes the token's input hidden state through unchanged, skipping transformation), and the constant expert (replaces the token's MoE output with a fixed learned vector). All three are scored by the same router that scores FFN experts, and any can appear in a token's top-k selection. Because they require no FFN computation and carry negligible parameters, they can be replicated on every GPU in the expert-parallel group, eliminating cross-GPU dispatch for tokens that route to them.
How do gating residuals work in MoE++?
Gating residuals carry the routing pathway signal from one MoE layer to the next. At each layer, the token's routing scores from the previous layer are mixed into the current layer's router logits before top-k selection. This allows the router to account for which expert branch the token followed previously — not just what the token's current hidden state looks like. The verified effect is reduced routing-score variance with unchanged mean and range, producing more stable expert selection across depth without forcing deterministic routing.
Why does MoE++ reduce communication overhead?
Standard expert-parallel MoE dispatches every token to its selected FFN expert via All-to-All communication, regardless of whether that token actually needs complex transformation. MoE++ reduces dispatch volume by routing some tokens to zero-computation experts that live locally on every GPU. Tokens handled by zero-cost experts never enter the All-to-All dispatch path — their "computation" completes locally in constant time. This reduces the total volume of tokens crossing inter-GPU links, directly addressing the communication bottleneck that expert parallelism creates at scale.
What is the difference between MoE and MoE++?
Vanilla MoE routes every token to one or more FFN experts, with no option for the router to select a zero-computation outcome. MoE++ extends the candidate set to include zero, copy, and constant experts alongside standard FFN experts, making it a heterogeneous routing framework. Additionally, MoE++ introduces gating residuals — cross-layer routing state carryover — which vanilla MoE lacks. The net effect is that tokens can dynamically engage different numbers of FFN experts per layer (including zero), and routing decisions are informed by layer history rather than each layer operating in isolation.
Sources and references
Production Note: The claims in this article are sourced from the MoE++ arXiv paper (canonical paper claims and reported benchmarks), the ICLR 2025 oral page (abstract-level mechanism summary and deployment-friendly framing), and the SkyworkAI GitHub repository (implementation-oriented wording including the three expert types and gating-residual behavior descriptions). Distinguish paper-claimed behavior — particularly the 1.1–2.1× throughput figure — from what may appear in downstream reimplementations; the repository code may diverge from the paper's exact configuration over time.
- MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts (arXiv:2410.07348v1) — Primary source; all mechanism descriptions and throughput claims originate here
- MoE++ ICLR 2025 Oral Presentation — Abstract-level summary, deployment-friendly framing, and gating-residual description
- ICLR 2025 Proceedings: MoE++ paper — Conference proceedings page confirming "general and heterogeneous MoE framework" framing
- SkyworkAI/MoE-plus-plus GitHub Repository — Implementation-level wording for zero expert types and gating-residual variance properties
- Shortcut-Connected Expert Parallelism for MoE (arXiv:2404.05019) — Source for All-to-All communication bottleneck characterization in expert-parallel systems
- NVIDIA: Optimizing Communication for MoE Training with Hybrid Expert Parallelism — Hardware-level context for NVLink, IBGDA, and bandwidth-aware expert parallelism
- NVIDIA: Scaling Large MoE Models with Wide Expert Parallelism on NVL72 — Expert parallelism definition and deployment context
Keywords: MoE++, Mixture-of-Experts (MoE), Switch Transformer, Mixtral, DeepSeek-V3, SkyworkAI/MoE-plus-plus, Zero expert, Copy expert, Constant expert, gating residuals, expert parallelism, NVLink, NCCL, H100, OpenReview



