Data parallelism + ZeRO/FSDP
Single-GPU training works up to about 1B parameters. For anything larger, you need to distribute the work across multiple GPUs. The simplest distribution — data parallelism (DP) — replicates the entire model on every GPU and gives each a different chunk of the training batch. Each GPU computes its own gradients; gradients are averaged via all-reduce; each GPU applies the same optimizer update. Beautifully simple, but it requires the model + optimizer state to fit on ONE GPU. ZeRO (Rajbhandari 2020) extends DP by sharding the bookkeeping across ranks, allowing each GPU to hold only 1/N of the optimizer state, gradients, and (optionally) parameters. FSDP (PyTorch’s Fully Sharded Data Parallel) is the production implementation. This section walks both.
Plain data parallelism
The baseline DP setup:
The all-reduce step is the dominant communication. For a 70B model, ~280 GB of gradients per rank must be summed across N ranks every step.
The all-reduce cost
Ring all-reduce is the canonical algorithm. Modern libraries (NCCL for NVIDIA, RCCL for AMD) implement it directly in CUDA / ROCm kernels, taking advantage of NVLink and InfiniBand simultaneously where possible.
Setup:
1024 GPUs total, organised as 128 nodes × 8 GPUs/node.
Intra-node: NVLink at ~900 GB/s per link.
Inter-node: InfiniBand HDR at ~50 GB/s per link.
Gradient size G = 140 GB.
Ring all-reduce cost:
If the ring traverses ONLY NVLink (impossible at this scale, since you only have 8 GPUs in a node):
Traffic per link: 2·G/N = 2·140/1024 = 273 MB. At 900 GB/s: 0.3 ms per round, × (N-1) rounds ≈ 300 ms per all-reduce.
If the ring traverses ONLY InfiniBand:
Per link: 273 MB. At 50 GB/s: 5.5 ms per round × 1023 rounds ≈ 5.6 seconds. Way too slow.
The hierarchical all-reduce solution:
1. Local all-reduce within each node (over NVLink): Each node’s 8 GPUs ring-reduce their gradients. Takes 2·G·7/8 / 900 GB/s ≈ 270 ms.
2. Cross-node all-reduce (over InfiniBand): One representative per node now holds the node-local sum. 128 representatives all-reduce across nodes. Total traffic per representative: 2·G·127/128 ≈ 2·G = 280 GB. At 50 GB/s: ~5.6 seconds per representative.
3. Local broadcast within node: The representative broadcasts the result back to its 7 peers via NVLink. Fast.
Total: ~270 ms (intra-node) + ~5.6 s (inter-node) + ~30 ms (broadcast) ≈ 6 seconds per step.
Why intra-node vs inter-node matters:
If everything were NVLink: ~300 ms per step. If everything were InfiniBand: ~6 s per step. 20× ratio.
Modern training runs are sized so the model and TP/PP groups fit within nodes (NVLink), and only DP all-reduce crosses node boundaries (InfiniBand). This puts the HIGH-bandwidth ops on the FAST network and the LOW-bandwidth ops on the SLOW one.
If you scale DP across more nodes without proportional InfiniBand upgrades, the cross-node all-reduce time grows quadratically (well, log-linearly with ring all-reduce, but still). Training throughput degrades.
This is why the LLama 3 70B training uses ~16K GPUs (not 64K): the cross-node DP all-reduce becomes the bottleneck above ~16K. Going higher requires hybrid-precision gradients, asynchronous DP, or other techniques.
ZeRO — sharding the bookkeeping
Rajbhandari 2020 “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models” introduced the structural fix. The insight: in plain DP, the SAME optimizer state and gradients are HELD ON EVERY RANK. This is redundancy — you don’t need N copies.
ZeRO’s three stages are increasingly aggressive memory-saving techniques with increasingly higher communication overhead. The choice depends on the model-vs-memory ratio.
FSDP — PyTorch’s incarnation
FSDP is the PyTorch-native implementation of ZeRO Stage 3:
Setup:
70B model, mixed-precision training:
- bf16 weights: 140 GB
- fp32 master weights: 280 GB
- fp32 gradients: 280 GB
- fp32 Adam m, v: 560 GB
- Total: 1260 GB
8× H100 = 8 × 80 GB = 640 GB total HBM.
Without sharding: doesn’t fit (1.26 TB > 640 GB).
Stage 1 (shard Adam m, v):
Per rank: weights (140 GB) + master (280 GB) + grads (280 GB) + Adam/8 (70 GB) = 770 GB / 8 = 96 GB. Doesn’t fit.
Stage 2 (shard grads + Adam):
Per rank: weights (140 GB) + master (280 GB) + (grads + Adam)/8 = 140 + 280 + 105 = 525 GB / 8 = 66 GB. Fits! Tight though.
Stage 3 (shard everything):
Per rank: (weights + master + grads + Adam) / 8 = 1260 / 8 = 158 GB. Fits with lots of room for activations.
Wait, that’s wrong. Let me recompute:
Stage 3 per rank: all four shards = 1260 / 8 = 158 GB. Hmm, that’s more than 80 GB.
Need to be more precise: at Stage 3, the PER-RANK memory is 1/N of EVERYTHING:
1260 / 8 = 157.5 GB. Doesn’t fit on 80 GB H100.
So even Stage 3 needs more than 8 H100s for a 70B model. In practice:
- For 70B on 8 H100s: also use bf16 master (saves 1× weights of memory). Plus CPU offload of less-frequently-used state. Hard.
- For 70B on 16 H100s: ZeRO Stage 3 gives ~79 GB/rank, fits.
- For 70B on 32 H100s: ~40 GB/rank, comfortable.
Typical: 70B fine-tuning uses 16-32 H100s with Stage 3.
Communication cost comparison:
Stage 1: only optimizer step needs comm (all-gather updated params, ~14 GB / step). Minor.
Stage 2: reduce-scatter on gradients (~140 GB / step traffic). About same as plain DP all-reduce.
Stage 3: full all-gather + reduce-scatter per LAYER per forward + backward. For 80 layers, this is significantly more communication. Throughput typically 15-25% lower than Stage 2.
So Stage 2 is optimal when memory permits (no over-communication); Stage 3 is required when you must shrink per-rank memory further (at communication cost).
For 8 H100s training 70B: not really feasible. You’d need either to (a) use 16+ GPUs, (b) reduce model size, or (c) use aggressive offloading (CPU/NVMe). Production training of frontier 70B is on 1024+ GPUs anyway.
When to use what
Why FSDP dominates:
1. Memory is the binding constraint. For 70B+ models, getting them to fit at all is hard. FSDP makes it possible on commodity 8-16 GPU clusters. The communication overhead (15-25% throughput loss) is acceptable when the alternative is “can’t train at all.”
2. Communication overhead has been optimised. Modern FSDP supports:
- OVERLAP of all-gather with compute (next layer’s all-gather starts during current layer’s compute).
- REDUCE_OP fusion (combining multiple reduce ops into one).
- FORWARD PREFETCH (all-gather params for upcoming layers).
These cut effective overhead to ~10-15% in practice.
3. FSDP is simpler than TP+PP+DP. Tensor parallelism requires modifying model code; pipeline parallelism requires custom scheduling. FSDP is “wrap the model, train as normal.” For most teams (non-frontier labs), the simpler API matters more than the extra ~10% throughput.
CPU offload — the next dimension:
FSDP supports OFFLOADING param shards (or optimizer state) to CPU RAM instead of GPU HBM. The full picture:
- HBM: full state for the layer being computed RIGHT NOW (after just-in-time all-gather).
- System RAM: optimizer state for layers not being updated. CPU RAM is cheap (TB scale).
- NVMe SSD (ZeRO-Infinity): for truly huge state, offload further to disk.
This lets you train EVEN BIGGER models on the same hardware:
- 1× H100 + 256 GB RAM + FSDP CPU offload: can fine-tune ~40B model. Slow but works.
- 8× H100 + 2 TB RAM + ZeRO-Infinity: can fine-tune 200B+ models. Very slow but possible.
The trade-off: CPU offload makes the optimizer step slow (move state to GPU for update, then back). For inference, no problem (forward only). For training: ~2-5× slowdown per step.
When CPU offload is worth it:
- Research / experimentation on a single workstation.
- Fine-tuning very large models without a giant cluster.
- QLoRA-style PEFT: 4-bit base + small adapter; FSDP shards the adapter. Memory is way smaller, but offload still helps with optimizer state for the adapter.
What this enables:
The “fine-tune 70B Llama on a single H100” tutorials use a combination of:
- QLoRA (4-bit base, small fp16 adapter).
- FSDP-style sharding (across multiple GPUs if available).
- CPU offload (for Adam state when GPU memory is tight).
- Gradient checkpointing (recompute activations during backward).
Together: the entire training fits in ~40 GB GPU memory + 100+ GB CPU RAM. Production-grade fine-tuning becomes accessible.
Next: §24.2 — Tensor, pipeline, and expert parallelism. The other parallelism dimensions that combine with DP/FSDP to make trillion-parameter training feasible.