Expert parallelism + load balancing
A 671B MoE doesn’t fit on a single GPU. The experts have to be distributed across the cluster — each GPU holds a SUBSET of the experts. At inference time, each token’s top-k expert assignments may land on any of those GPUs, so the system has to GATHER tokens by their routing decision and SCATTER outputs back. This pattern is expert parallelism, and the gather/scatter is implemented as an all-to-all collective — the most expensive type of inter-GPU communication. Worse, the router can degenerate during training: positive feedback (popular experts get more tokens → train better → become more popular) leads to a few experts dominating while most go unused. This section covers expert parallelism, the load-balancing auxiliary loss that prevents collapse, and DeepSeek V3’s clever loss-free balancing.
Expert parallelism — sharding the experts across GPUs
For a 256-expert MoE on 8 GPUs, the natural shard is 32 experts per GPU. Each token in a batch:
- Runs forward through attention (which is replicated on every GPU — tensor parallel).
- Computes router gate logits (which is fast and replicated).
- Picks top-k experts (could be on any GPU).
- Sends its activation to the GPUs holding its assigned experts. (All-to-all communication.)
- Each GPU runs its local experts on the tokens routed to them.
- Each GPU sends the expert outputs back to the token’s original GPU. (Second all-to-all.)
- The token combines the k outputs and continues to the next layer.
Expert parallelism is what makes MoE at 100B+ total params tractable. The all-to-all overhead is real but manageable; at 1-10 GB/s aggregate it’s well within modern interconnect bandwidth (NVLink5: 1.8 TB/s per GPU; InfiniBand HDR: 50 GB/s per link).
The downside: communication-vs-compute overlap is harder for MoE than for dense models. You can’t start computing the expert until tokens arrive; you can’t return results until compute finishes. This puts MoE inference latency at a slight disadvantage vs same-active-params dense models for small batch sizes.
Expert collapse — the canonical failure
The MoE router is a small network whose only training signal comes from “what experts produce good outputs?” Without intervention, this signal develops a positive feedback loop:
Expert collapse is the most common MoE failure mode. It’s not subtle — training looks fine, loss goes down, but a probe of expert utilization shows ~95% of tokens going to 2-3 experts out of 8.
The auxiliary load-balancing loss
Shazeer 2017 introduced the standard fix: add an auxiliary loss that penalises imbalanced expert utilization.
The kernel for this chapter implements top-k routing with the aux loss in action:
similar tokens here). This is what reinforcement does. */
float lr = 0.001f;
for (int j = 0; j < D; j++)
W_no_aux[j * N_EXP + e] += lr * tokens_X[t][j] * w[kk];
}
}
}
/* Train the aux router — same thing PLUS the load-balancing aux gradient. */
for (int epoch = 0; epoch < 3; epoch++) {
/* Per-epoch frac_tokens (how often each expert was top-1) — proxy. */
float frac[N_EXP] = {0};
float mean_prob[N_EXP] = {0};
for (int t = 0; t < N_TOKENS; t++) {
float gate[N_EXP];
for (int e = 0; e < N_EXP; e++) {
float s = 0;
for (int j = 0; j < D; j++) s += tokens_X[t][j] * W_aux[j * N_EXP + e];
gate[e] = s;
}
float probs[N_EXP]; memcpy(probs, gate, sizeof(probs));The aux loss is small (multiplied by α ≈ 0.01) but enough to keep the router balanced. Without it, the demo router shows the typical “10/16 experts effectively used” pattern of expert collapse.
Setup:
f_i = fraction of tokens routed to expert i (sums to k if top-k routing).
p_i = mean gate probability for expert i (sums to 1 across all i).
Perfectly uniform load (all experts equally utilised):
f_i = k / N for all i.
p_i = 1 / N for all i.
L_aux = N · Σ (k/N) · (1/N) = N · N · (k/N²) = k.
For top-1 (k=1): L_aux = 1 (minimum).
Perfectly concentrated (all tokens to expert 0):
f_0 = k, f_i = 0 for i > 0.
p_0 ≈ 1, p_i ≈ 0 for i > 0.
L_aux ≈ N · k · 1 = N · k.
For top-1, N=8: L_aux = 8 (maximum, 8× the minimum).
Why this works:
L_aux multiplies the “physically-assigned” load (f, hard top-k counts) by the “router’s preference” (p, soft probabilities). When both align (expert i gets many tokens AND has high probability), the product is large — penalised. When they diverge or both are uniform, the product is small.
Why both f AND p:
If we only used p (e.g., L_aux = Σ p_i²), we’d penalise the router’s softmax output but not its actual routing decisions. The top-k is non-differentiable; using f directly as a loss can’t backprop. The product f · p creates a differentiable signal: f provides the load measurement; p provides the differentiable handle the router can adjust.
Practical α:
The total loss is L = L_LM + α · L_aux. α ≈ 0.001-0.01 in most papers. Too low: collapse. Too high: router becomes uniformly random, destroying specialisation. The sweet spot is usually empirically tuned per architecture.
The trade-off — aux loss hurts specialisation
The aux loss does what it says, but at a cost: it forces the router to spread tokens “fairly” even when fairness conflicts with specialisation. A token that genuinely best fits expert 5 might get routed to expert 9 because expert 5 is “over-utilised.”
This trade-off is the source of every MoE training paper’s tuning struggle. Too low α: collapse. Too high α: balanced but bland routing, no specialisation, MoE behaves like a noisy dense FFN with k/N of the capacity.
The ambiguity: 4-of-16 carrying 80% can mean either, and the question is real.
Signs it’s COLLAPSE (bad):
- The 4 dominant experts are the SAME for every layer (e.g., always experts 0, 3, 7, 11). This suggests a routing artifact, not specialisation.
- The 12 unused experts have near-zero gate probabilities for all tokens — they’re not just “rarely picked”, they’re “completely ignored.”
- The 4 dominant experts’ parameter norms are growing while the unused experts’ norms stay near initialisation.
- Token attribution: when you mask out the unused 12 experts entirely, perplexity barely changes. The model effectively isn’t using them.
- The aux loss is decreasing slowly or not at all, even though it should be the strongest “balance” signal.
Signs it’s SPECIALISATION (good):
- Different layers’ dominant experts are DIFFERENT (layer 1 uses experts 2, 5, 8; layer 2 uses experts 0, 3, 11; etc.). This suggests each layer found a different specialisation pattern.
- The 12 less-used experts STILL have meaningful gate probabilities (5-10% on some tokens) — they’re handling rare patterns, not abandoned.
- Masking out an unused expert HURTS perplexity slightly — they’re contributing to specific token types.
- Token-type analysis: the dominant experts handle “common” patterns (function words, frequent subjects); the rare experts handle specific patterns (rare topics, code, math).
Additional signals:
- Routing entropy. Compute H(routing) = E_token [H(p_token)]. Healthy MoE has medium entropy (the router makes specific but not extreme choices). Collapse has very low entropy (always picks the same).
- Per-token expert diversity. Over a batch, what fraction of (token, expert) pairs are taken vs the theoretical maximum (N_tokens × N_experts)? Healthy MoE has high coverage (most expert-token pairs occur at least once). Collapse has low coverage.
- Expert ablation. Drop each expert one at a time and measure perplexity. Healthy MoE shows uniform-ish perplexity drops. Collapse shows huge drops for the dominant experts and zero drops for the unused ones.
The fix if it’s collapse:
- Increase aux loss α (e.g., 0.001 → 0.01). Re-train.
- Add capacity-factor capping during training: if expert i is over-utilised in a batch, hard-stop adding tokens to it (forcing the next-best expert).
- Re-initialise the under-used experts and continue training.
- Switch to a different balancing approach: DeepSeek V3’s loss-free bias updates (next section) or sequence-level balancing.
Real frontier MoE training pipelines instrument all these signals; “collapse vs specialisation” diagnostics are a daily part of MoE development.
DeepSeek V3 — auxiliary-loss-free load balancing
DeepSeek V3 (2024) introduced a notable refinement: balance the load without an explicit auxiliary loss. The motivation was empirical — aux loss forces “fairness” that conflicts with the router’s quality signal, often hurting downstream evaluation by 0.1-0.3 perplexity. Eliminating it while preserving balance saves quality.
The DeepSeek approach decouples “what the router prefers” (learned via the LM loss) from “how to prevent collapse” (a heuristic outside the gradient). The router can learn to specialise without being penalised for the resulting imbalance; the bias term separately prevents pathological concentration.
The conflict the standard aux loss creates:
The LM loss L_LM wants the router to pick experts that produce good token predictions — possibly heavily favoring a few experts.
The aux loss L_aux wants the router to balance load — possibly forcing it to pick experts that are worse for the specific token.
The router’s gradient is ∂(L_LM + α · L_aux) / ∂ router_params. The two terms PULL IN OPPOSITE DIRECTIONS for over-utilised experts. The router must compromise between quality and balance, both during training.
Result: router specialisation is partially suppressed. The MoE benefit is reduced (the experts are less specialised than they could be).
The DeepSeek fix:
Add a per-expert bias b_i to the routing logits. Update b heuristically (no gradient): bump b_i down if expert i is over-utilised this batch, up if under-utilised. The bias provides the balancing signal OUTSIDE the gradient.
Now: ∂L_LM / ∂ router_params is the only gradient. The router optimises purely for quality. Balance is enforced by the (non-trainable) bias correction.
Why this works in principle:
The router learns the “true” quality preference (which experts are best for which tokens). The bias acts as a “force field” that pushes the router away from collapse without distorting its quality signal. At convergence, the bias stops growing (the system has reached a balanced fixed point where preference and balance agree).
What could go wrong:
- Bias update too slow: if ε is too small, the bias can’t catch up to growing imbalances during training. Collapse happens before balance kicks in.
- Bias update too aggressive: if ε is too large, the bias oscillates between over and under correction, never settling. Routing becomes unstable.
- Bias provides no signal to the router: the router can’t learn that “expert 5 is over-utilised” because the bias intercepts that signal. The router might keep trying to route to expert 5 because its weights are biased toward it, only to be redirected by the bias term. This can hurt training dynamics.
- Inference distribution shift: at inference, the bias is frozen. If inference data has a different routing distribution than training (different domain, different language mix), the frozen bias might be miscalibrated, leading to balance issues.
- Hyperparameter sensitivity: the bias-update rule has tunable knobs (ε, update frequency, smoothing) that need empirical tuning per architecture. Less robust than aux loss for a given hyperparameter budget.
Empirically: DeepSeek V3 reports better downstream quality and similar load-balance vs the aux-loss baseline. Other labs (Llama 4, Qwen 3) have adopted variations of this approach.
The deeper insight: in MoE, the right design philosophy is “let the router learn quality; enforce constraints via non-learnable mechanisms.” This is a recurring pattern in ML — separating “what the network learns” from “what the system constrains” often yields better empirical results than rolling both into the gradient.
END OF CH.17 — Mixture of Experts.
§1 (routing + sparse activation: Mixtral 8x7B = 47B total / 13B active) ·
§2 (capacity vs compute asymmetry: DeepSeek V3 671B / A37B at 18× sparsity, bandwidth bottleneck) ·
§3 (expert parallelism + load balancing: aux loss, DeepSeek’s loss-free balancing).
Next: Ch.18 — Alignment. SFT, RLHF, the closed-form DPO derivation, and modern simplifications (GRPO, RLAIF). Turning a next-token predictor into a usable assistant.