Mixture-of-Experts

§1 Routing + sparse activation
A Mixture-of-Experts (MoE) layer replaces a single FFN with N small FFNs (experts) and a router. Per token, the router picks the top-k experts; only those k compute. Total parameters scale with N (more capacity); compute per token scales with k (fixed cost). Mixtral 8x7B: 47B total params, 13B active per forward — 8 experts, k=2 routing. Used in every 2024+ frontier model: GPT-4, DeepSeek V3, Claude 3, Gemini 1.5.
§2 Capacity vs compute — the asymmetric scaling
MoE has two parameter counts: total (lives in HBM, dominates memory cost) and active (computes per token, dominates compute cost). The ratio active/total can be ~1/4 (Mixtral) or ~1/18 (DeepSeek V3 671B with A37B active). Total parameters are bought with HBM; active parameters are paid in FLOPs per token. This section walks the economic math, the DeepSeek V3 numbers, and why memory-bandwidth (not flops) is the real bottleneck for MoE inference.
§3 Expert parallelism + load balancing
MoE introduces two engineering problems with no dense equivalent: (1) experts have to be sharded across GPUs (expert parallelism), and tokens routed to remote experts require all-to-all communication. (2) The router can degenerate — sending too many tokens to a few experts ("expert collapse") — destroying the capacity benefit. The canonical fix is the auxiliary load-balancing loss (Shazeer 2017); DeepSeek 2024 replaced it with a loss-free bias-update scheme. This section walks both, with a kernel showing the failure mode.

← ALL CHAPTERS