Tensor + pipeline + expert parallelism
Data parallelism scales batch size but doesn’t address per-rank model size. ZeRO/FSDP shard the optimizer/gradient/parameter state but require ALL-GATHER per layer, capping throughput. For truly large models (200B+) trained at scale, three additional parallelism axes come into play: tensor parallelism (TP) splits each matmul across GPUs; pipeline parallelism (PP) splits the model’s layers into sequential stages; expert parallelism (EP) for MoE places different experts on different GPUs. Combined with DP, this forms a 3D (or 4D with EP) parallelism grid that maps the model onto a cluster. This section walks each axis and their interactions.
Tensor parallelism — splitting matmuls
Megatron-LM (Shoeybi 2019) introduced the canonical tensor parallelism design. The idea: split each large matmul across TP_size GPUs, so each GPU only holds a SLICE of each weight matrix and computes a PARTIAL output.
Tensor parallelism is COMMUNICATION-INTENSIVE — each transformer block does 2 all-reduces of size ~N×d per pass. For high throughput, TP must use NVLink-class bandwidth. Going across nodes (InfiniBand) destroys throughput.
Pipeline parallelism — splitting layers
While TP splits each layer, pipeline parallelism splits the model into sequential layers groups:
The pipeline bubble
Pipeline bubble is the main efficiency loss in PP. More microbatches = less bubble, but at the cost of activation memory (need to keep K microbatches’ activations alive in memory).
Modern improvements (Qi 2024 “Zero Bubble Pipeline”) split the backward pass into smaller stages and reorder operations to eliminate the bubble almost entirely. At frontier scale (Llama 3, DeepSeek V3), bubble overhead is reduced to ~5% or less.
Setup:
70B model. TP=8 (within node, NVLink), DP=128 (across nodes, InfiniBand). Total 1024 GPUs.
Each TP group has its own copy of the model (8 GPUs share one model). Each DP rank has a TP group.
Effective per-GPU model size: 70B / 8 = 8.75B params.
TP communication (within each node):
Per transformer block: 2 all-reduces of size ~N · d · 2 bytes = 8K · 4K · 2 = 64 MB each.
Per layer: 2 all-reduces. Per forward + backward: 4 all-reduces per layer.
80 layers × 4 all-reduces × 64 MB = 20 GB of TP traffic per microbatch per pipeline stage.
With ring all-reduce among 8 GPUs (factor 7/8 efficiency): per-GPU per-link traffic = 20 GB · 2 · 7/8 = 35 GB per pass.
At NVLink 900 GB/s: 35 GB / 900 GB/s = 40 ms per pass.
Per training step with 128 microbatches: 128 · 40 ms = 5 seconds of TP comm? That’s a lot.
Actually: TP all-reduces happen WITHIN each microbatch. Per microbatch: 80 layers × 2 all-reduces × 64 MB = 10 GB → 10 ms. Per step with 128 microbatches: 128 · 10 = 1280 ms.
Still significant but ~30% of step time. Modern impl overlaps TP comm with compute, reducing this further.
DP communication (across nodes):
Per step: ONE all-reduce of full gradient. Gradient size: 70B / 8 (TP shard) = 8.75B params → 17.5 GB in bf16.
Ring all-reduce among 128 DP ranks: per-link traffic = 17.5 GB · 2 · 127/128 = 35 GB per link.
At InfiniBand 50 GB/s: 35 GB / 50 GB/s = 700 ms per step.
This is the dominant communication cost. At frontier scale, this is often the bottleneck.
Why this split:
- TP within node: needs NVLink bandwidth (900 GB/s) to be tractable; per-block frequent.
- DP across nodes: only one all-reduce per STEP (not per block); InfiniBand bandwidth (50 GB/s) is adequate because of the lower frequency.
Putting TP across nodes would multiply traffic on the slow InfiniBand link by ~80 (number of layers) — completely infeasible.
Putting DP within node would force you to either have very large microbatches (memory limit) or under-utilise the NVLink bandwidth for the rare DP all-reduce.
The hierarchical mapping matches the communication hierarchy to the bandwidth hierarchy. This is THE central insight of distributed LLM training infrastructure.
Expert parallelism — for MoE
For MoE models (Ch.17), the experts are an additional parallelism axis. Each rank holds a subset of the experts.
The 3D / 4D parallelism grid
For Llama 3 70B at 16K GPUs:
Bubble math (1F1B):
Efficiency = K / (K + 2(P-1)).
P = 8 stages.
For K = 16: efficiency = 16 / (16 + 14) = 16 / 30 = 53%. Half the time is bubble.
For K = 64: efficiency = 64 / (64 + 14) = 64 / 78 = 82%. Bubble is much smaller.
For K = 128: efficiency = 128 / 142 = 90%.
Why more K is better:
More microbatches → less proportional time in the startup/teardown bubble → higher utilisation.
Why you can’t just use K = infinity:
Each microbatch’s activations must be kept ALIVE in memory until that microbatch’s backward pass completes. With K microbatches in flight, you have K × activation_memory bytes pinned.
For a 70B model with batch size 8K tokens per microbatch and ~32 transformer layers’ worth of activations: each microbatch’s activations ≈ 8K · 4096 · 2 · 32 ≈ 2 GB.
With K = 128: 128 · 2 = 256 GB of activation memory per pipeline stage. Too much for a single GPU.
With K = 16: 16 · 2 = 32 GB. Fits in 80 GB H100 with model + Adam state.
The trade-off:
- Few K (e.g., 16): less activation memory, low PP efficiency (40-60%).
- Many K (e.g., 128): high PP efficiency (90%+), but doesn’t fit in memory.
Optimal K depends on how much activation memory each microbatch consumes vs how much GPU memory is available.
Activation checkpointing helps:
Recompute activations during backward instead of storing them. Reduces activation memory ~4-8× at cost of 25-30% more compute.
Allows higher K with the same memory budget. Llama 3 uses K = 128 with checkpointing for ~95% PP efficiency.
Zero Bubble pipeline (Qi 2024):
Recent technique that splits the backward pass into two stages: gradient w.r.t. inputs (small) and gradient w.r.t. weights (large). Overlap them to eliminate the bubble entirely. Used in some 2024+ training runs.
Even with the bubble: PP is still chosen because it enables training models that wouldn’t fit otherwise. 5-10% efficiency loss is acceptable when the alternative is “can’t train at all.”
Setup:
671B params total, 37B active per token. 256 experts, top-8 routing. 60+ transformer layers.
2K H100s (256 nodes × 8 GPUs/node).
Per-rank memory needs:
Without sharding: 671B × 2 bytes = 1.34 TB of weights. Plus Adam, gradients, activations. Far too much.
Parallelism strategy (one likely configuration):
Expert parallelism (EP): 8-16. Each rank holds 16-32 experts (out of 256). Within a node, 8 GPUs share the 256 experts. EP is constrained to within-node because all-to-all communication is bandwidth-intensive.
Tensor parallelism (TP): 2-4. The attention and per-expert FFN matrices are TP-sharded. Within node alongside EP.
Pipeline parallelism (PP): 4-8. Layers split across pipeline stages. Each stage is a node-group (PP between nodes).
Data parallelism (DP): Whatever fits: DP = 2048 / (PP · TP · EP) = 2048 / 32 ≈ 64.
Effective batch: 64 (DP) × N (microbatches) per step.
Memory check (per rank):
- Attention weights (TP=4): per-layer attention params / 4. Small.
- FFN / Expert weights (EP=16, 16 experts × ~2.3B / 16 = 2.3B per rank for experts). Plus shared expert.
- Total model: ~30-40B params per rank.
- In fp16: ~60-80 GB. Tight on 80 GB H100, needs aggressive ZeRO.
Actual config: likely combines all of the above + ZeRO Stage 1 (shard Adam) + activation checkpointing.
Communication hierarchy:
- EP all-to-all: intra-node NVLink (per-layer, frequent).
- TP all-reduce: intra-node NVLink (per-layer).
- PP point-to-point: inter-node InfiniBand (per-microbatch).
- DP all-reduce: inter-node InfiniBand (per-step).
The bandwidth-mapping rule:
Higher-frequency communications go on faster links. Per-step DP all-reduce can tolerate InfiniBand; per-layer TP and EP cannot.
DeepSeek’s published configuration (from their paper):
They actually run with EP=8 (intra-node), PP=8 (pipeline across nodes), DP across the remaining. Plus selective activation checkpointing. Their throughput is reportedly ~300 TFLOPs/GPU sustained — quite good for an MoE.
The 3D/4D parallelism grid is a multi-dimensional optimisation problem. Each frontier lab has internal tools to search the configuration space; the answer depends on model architecture, cluster topology, and target latency/throughput.
Next: §24.3 — Context parallelism and the multi-node systems problem. The newest parallelism axis (for very long sequences), and the broader systems picture.