Expert parallelism + load balancing

Section 17.3

Expert parallelism + load balancing

A 671B MoE doesn’t fit on a single GPU. The experts have to be distributed across the cluster — each GPU holds a SUBSET of the experts. At inference time, each token’s top-k expert assignments may land on any of those GPUs, so the system has to GATHER tokens by their routing decision and SCATTER outputs back. This pattern is expert parallelism, and the gather/scatter is implemented as an all-to-all collective — the most expensive type of inter-GPU communication. Worse, the router can degenerate during training: positive feedback (popular experts get more tokens → train better → become more popular) leads to a few experts dominating while most go unused. This section covers expert parallelism, the load-balancing auxiliary loss that prevents collapse, and DeepSeek V3’s clever loss-free balancing.

Expert parallelism — sharding the experts across GPUs

For a 256-expert MoE on 8 GPUs, the natural shard is 32 experts per GPU. Each token in a batch:

Runs forward through attention (which is replicated on every GPU — tensor parallel).
Computes router gate logits (which is fast and replicated).
Picks top-k experts (could be on any GPU).
Sends its activation to the GPUs holding its assigned experts. (All-to-all communication.)
Each GPU runs its local experts on the tokens routed to them.
Each GPU sends the expert outputs back to the token’s original GPU. (Second all-to-all.)
The token combines the k outputs and continues to the next layer.

Expert parallelism communication per layer per token: 1st all-to-all: send activation (~d bytes) to each of k expert hosts 2nd all-to-all: receive output (~d bytes) from each of k expert hosts Total bytes transferred per token per layer: 2 · k · d · sizeof(dtype) For DeepSeek V3 (k=8, d=7168, bf16=2 bytes): 2 · 8 · 7168 · 2 = ~230 KB per token per layer. × 57 MoE layers = ~13 MB per token total. At a batch size of 1000 tokens per second: ~13 GB/s of all-to-all traffic. Across the network, this is heavy but tractable with NVLink/IB.

Expert parallelism distributed training A parallelism strategy specific to MoE models: experts are sharded across GPUs (typically one or a few experts per GPU). At inference and training time, tokens are routed via all-to-all communication to the GPU holding their assigned experts; expert outputs are returned via a second all-to-all. The communication pattern is fundamentally different from tensor parallelism (all-reduces) or pipeline parallelism (peer-to-peer): all-to-all is bandwidth-symmetric and latency-sensitive. is what makes MoE at 100B+ total params tractable. The all-to-all overhead is real but manageable; at 1-10 GB/s aggregate it’s well within modern interconnect bandwidth (NVLink5: 1.8 TB/s per GPU; InfiniBand HDR: 50 GB/s per link).

The downside: communication-vs-compute overlap is harder for MoE than for dense models. You can’t start computing the expert until tokens arrive; you can’t return results until compute finishes. This puts MoE inference latency at a slight disadvantage vs same-active-params dense models for small batch sizes.

Expert collapse — the canonical failure

The MoE router is a small network whose only training signal comes from “what experts produce good outputs?” Without intervention, this signal develops a positive feedback loop:

The collapse positive-feedback loop: Initialise: router slightly favors expert A (random init artifact). Training step 1: - Tokens get routed to A more often than to others. - A's parameters get more gradient updates. - A becomes "better" at its routed tokens (higher gate prob for them). Training step 2: - Even more tokens routed to A (now genuinely best for many tokens). - A gets even more updates, becomes even better. ... Training step T: - Almost ALL tokens routed to A. - Experts B, C, ..., N receive ZERO gradient — never improve. - Effective capacity = 1 expert's worth, not N's worth. - Model fails to leverage MoE at all.

Expert collapse MoE failure mode A training failure where the router converges to using only a small subset of experts (often 1-2 out of N), starving the others of gradient signal. Caused by positive feedback: experts that get more tokens train better, making them better choices for more tokens. Without explicit intervention (load-balancing auxiliary loss or bias-update), this is the default outcome of training an MoE — the capacity benefit disappears entirely. is the most common MoE failure mode. It’s not subtle — training looks fine, loss goes down, but a probe of expert utilization shows ~95% of tokens going to 2-3 experts out of 8.

The auxiliary load-balancing loss

Shazeer 2017 introduced the standard fix: add an auxiliary loss that penalises imbalanced expert utilization.

Standard MoE auxiliary loss (Switch / GShard style): Let: f_i = fraction of tokens routed to expert i (in this batch) p_i = mean gate probability assigned to expert i (in this batch) L_aux = N_experts · Σ_i (f_i · p_i) Properties: - Σ_i f_i = k (each token picks k experts, so total assignments = k) - Σ_i p_i ≈ 1 (softmax probabilities sum to 1) - L_aux is minimised (= 1) when both f and p are uniform across experts. - L_aux is maximised when both concentrate on one expert. Total training loss: L = L_LM + α · L_aux where α typically 0.001-0.01 The aux loss gradient pushes the router AWAY from concentration: when expert i is over-utilised (high f_i, high p_i), L_aux pushes its bias DOWN; when under- utilised, it pushes it UP. The dual variable interpretation: L_aux acts like a Lagrangian dual for "expert capacity ≤ uniform" constraints.

The kernel for this chapter implements top-k routing with the aux loss in action:

moe_router.c — aux gradient C · top-k routing + aux load-balancing

                   similar tokens here). This is what reinforcement does. */
                float lr = 0.001f;
                for (int j = 0; j < D; j++)
                    W_no_aux[j * N_EXP + e] += lr * tokens_X[t][j] * w[kk];
            }
        }
    }

    /* Train the aux router — same thing PLUS the load-balancing aux gradient. */
    for (int epoch = 0; epoch < 3; epoch++) {
        /* Per-epoch frac_tokens (how often each expert was top-1) — proxy. */
        float frac[N_EXP] = {0};
        float mean_prob[N_EXP] = {0};
        for (int t = 0; t < N_TOKENS; t++) {
            float gate[N_EXP];
            for (int e = 0; e < N_EXP; e++) {
                float s = 0;
                for (int j = 0; j < D; j++) s += tokens_X[t][j] * W_aux[j * N_EXP + e];
                gate[e] = s;
            }
            float probs[N_EXP]; memcpy(probs, gate, sizeof(probs));

The aux loss is small (multiplied by α ≈ 0.01) but enough to keep the router balanced. Without it, the demo router shows the typical “10/16 experts effectively used” pattern of expert collapse.

— think, then check —

Setup:

f_i = fraction of tokens routed to expert i (sums to k if top-k routing).

p_i = mean gate probability for expert i (sums to 1 across all i).

Perfectly uniform load (all experts equally utilised):

f_i = k / N for all i.

p_i = 1 / N for all i.

L_aux = N · Σ (k/N) · (1/N) = N · N · (k/N²) = k.

For top-1 (k=1): L_aux = 1 (minimum).

Perfectly concentrated (all tokens to expert 0):

f_0 = k, f_i = 0 for i > 0.

p_0 ≈ 1, p_i ≈ 0 for i > 0.

L_aux ≈ N · k · 1 = N · k.

For top-1, N=8: L_aux = 8 (maximum, 8× the minimum).

Why this works:

L_aux multiplies the “physically-assigned” load (f, hard top-k counts) by the “router’s preference” (p, soft probabilities). When both align (expert i gets many tokens AND has high probability), the product is large — penalised. When they diverge or both are uniform, the product is small.

Why both f AND p:

If we only used p (e.g., L_aux = Σ p_i²), we’d penalise the router’s softmax output but not its actual routing decisions. The top-k is non-differentiable; using f directly as a loss can’t backprop. The product f · p creates a differentiable signal: f provides the load measurement; p provides the differentiable handle the router can adjust.

Practical α:

The total loss is L = L_LM + α · L_aux. α ≈ 0.001-0.01 in most papers. Too low: collapse. Too high: router becomes uniformly random, destroying specialisation. The sweet spot is usually empirically tuned per architecture.

↳ §17.3 + Shazeer 2017

The trade-off — aux loss hurts specialisation

The aux loss does what it says, but at a cost: it forces the router to spread tokens “fairly” even when fairness conflicts with specialisation. A token that genuinely best fits expert 5 might get routed to expert 9 because expert 5 is “over-utilised.”

This trade-off is the source of every MoE training paper’s tuning struggle. Too low α: collapse. Too high α: balanced but bland routing, no specialisation, MoE behaves like a noisy dense FFN with k/N of the capacity.

— think, then check —

The ambiguity: 4-of-16 carrying 80% can mean either, and the question is real.

Signs it’s COLLAPSE (bad):

The 4 dominant experts are the SAME for every layer (e.g., always experts 0, 3, 7, 11). This suggests a routing artifact, not specialisation.
The 12 unused experts have near-zero gate probabilities for all tokens — they’re not just “rarely picked”, they’re “completely ignored.”
The 4 dominant experts’ parameter norms are growing while the unused experts’ norms stay near initialisation.
Token attribution: when you mask out the unused 12 experts entirely, perplexity barely changes. The model effectively isn’t using them.
The aux loss is decreasing slowly or not at all, even though it should be the strongest “balance” signal.

Signs it’s SPECIALISATION (good):

Different layers’ dominant experts are DIFFERENT (layer 1 uses experts 2, 5, 8; layer 2 uses experts 0, 3, 11; etc.). This suggests each layer found a different specialisation pattern.
The 12 less-used experts STILL have meaningful gate probabilities (5-10% on some tokens) — they’re handling rare patterns, not abandoned.
Masking out an unused expert HURTS perplexity slightly — they’re contributing to specific token types.
Token-type analysis: the dominant experts handle “common” patterns (function words, frequent subjects); the rare experts handle specific patterns (rare topics, code, math).

Additional signals:

Routing entropy. Compute H(routing) = E_token [H(p_token)]. Healthy MoE has medium entropy (the router makes specific but not extreme choices). Collapse has very low entropy (always picks the same).
Per-token expert diversity. Over a batch, what fraction of (token, expert) pairs are taken vs the theoretical maximum (N_tokens × N_experts)? Healthy MoE has high coverage (most expert-token pairs occur at least once). Collapse has low coverage.
Expert ablation. Drop each expert one at a time and measure perplexity. Healthy MoE shows uniform-ish perplexity drops. Collapse shows huge drops for the dominant experts and zero drops for the unused ones.

The fix if it’s collapse:

Increase aux loss α (e.g., 0.001 → 0.01). Re-train.
Add capacity-factor capping during training: if expert i is over-utilised in a batch, hard-stop adding tokens to it (forcing the next-best expert).
Re-initialise the under-used experts and continue training.
Switch to a different balancing approach: DeepSeek V3’s loss-free bias updates (next section) or sequence-level balancing.

Real frontier MoE training pipelines instrument all these signals; “collapse vs specialisation” diagnostics are a daily part of MoE development.

↳ §17.3 + production MoE training

DeepSeek V3 — auxiliary-loss-free load balancing

DeepSeek V3 (2024) introduced a notable refinement: balance the load without an explicit auxiliary loss. The motivation was empirical — aux loss forces “fairness” that conflicts with the router’s quality signal, often hurting downstream evaluation by 0.1-0.3 perplexity. Eliminating it while preserving balance saves quality.

DeepSeek V3 auxiliary-loss-free balancing: Add a per-expert bias term b_i (one scalar per expert, no gradient required): logits_routing = x · W_router + b ↓ (top-k selection on logits) ↓ (softmax over selected k for weighting) Update b after each batch: For each expert i: if expert i is over-utilised in this batch: b_i -= ε if expert i is under-utilised in this batch: b_i += ε (ε ≈ small constant like 0.001) Properties: - b is updated ONLY via the heuristic update, not via gradient. - The router itself trains normally; only the bias prevents extreme concentration. - At inference, b is FROZEN — no longer updated. - No aux loss = no gradient signal pushing the router away from its quality preference. - Empirical result: better load balance AND better task quality vs standard aux loss.

The DeepSeek approach decouples “what the router prefers” (learned via the LM loss) from “how to prevent collapse” (a heuristic outside the gradient). The router can learn to specialise without being penalised for the resulting imbalance; the bias term separately prevents pathological concentration.

— think, then check —

The conflict the standard aux loss creates:

The LM loss L_LM wants the router to pick experts that produce good token predictions — possibly heavily favoring a few experts.

The aux loss L_aux wants the router to balance load — possibly forcing it to pick experts that are worse for the specific token.

The router’s gradient is ∂(L_LM + α · L_aux) / ∂ router_params. The two terms PULL IN OPPOSITE DIRECTIONS for over-utilised experts. The router must compromise between quality and balance, both during training.

Result: router specialisation is partially suppressed. The MoE benefit is reduced (the experts are less specialised than they could be).

The DeepSeek fix:

Add a per-expert bias b_i to the routing logits. Update b heuristically (no gradient): bump b_i down if expert i is over-utilised this batch, up if under-utilised. The bias provides the balancing signal OUTSIDE the gradient.

Now: ∂L_LM / ∂ router_params is the only gradient. The router optimises purely for quality. Balance is enforced by the (non-trainable) bias correction.

Why this works in principle:

The router learns the “true” quality preference (which experts are best for which tokens). The bias acts as a “force field” that pushes the router away from collapse without distorting its quality signal. At convergence, the bias stops growing (the system has reached a balanced fixed point where preference and balance agree).

What could go wrong:

Bias update too slow: if ε is too small, the bias can’t catch up to growing imbalances during training. Collapse happens before balance kicks in.
Bias update too aggressive: if ε is too large, the bias oscillates between over and under correction, never settling. Routing becomes unstable.
Bias provides no signal to the router: the router can’t learn that “expert 5 is over-utilised” because the bias intercepts that signal. The router might keep trying to route to expert 5 because its weights are biased toward it, only to be redirected by the bias term. This can hurt training dynamics.
Inference distribution shift: at inference, the bias is frozen. If inference data has a different routing distribution than training (different domain, different language mix), the frozen bias might be miscalibrated, leading to balance issues.
Hyperparameter sensitivity: the bias-update rule has tunable knobs (ε, update frequency, smoothing) that need empirical tuning per architecture. Less robust than aux loss for a given hyperparameter budget.

Empirically: DeepSeek V3 reports better downstream quality and similar load-balance vs the aux-loss baseline. Other labs (Llama 4, Qwen 3) have adopted variations of this approach.

The deeper insight: in MoE, the right design philosophy is “let the router learn quality; enforce constraints via non-learnable mechanisms.” This is a recurring pattern in ML — separating “what the network learns” from “what the system constrains” often yields better empirical results than rolling both into the gradient.

↳ §17.3 + DeepSeek V3 paper

END OF CH.17 — Mixture of Experts.
§1 (routing + sparse activation: Mixtral 8x7B = 47B total / 13B active) · §2 (capacity vs compute asymmetry: DeepSeek V3 671B / A37B at 18× sparsity, bandwidth bottleneck) · §3 (expert parallelism + load balancing: aux loss, DeepSeek’s loss-free balancing).

Next: Ch.18 — Alignment. SFT, RLHF, the closed-form DPO derivation, and modern simplifications (GRPO, RLAIF). Turning a next-token predictor into a usable assistant.