V — The Systems That Run Them → Chapter 25
FROM SYSTEMS TO FRONTIER ML

Training at scale

Data / tensor / pipeline / expert / context parallelism, all-reduce, ZeRO/sharding. Why multi-node is a systems problem you already half-understand.

§1 Data parallelism + ZeRO/FSDP §2 Tensor + pipeline + expert parallelism §3 Context parallelism + the multi-node systems problem

← ALL CHAPTERS