A book for engineers who did this math 20 years ago
FROM SYSTEMS TO FRONTIER ML
From systems to frontier ML
A refresher and a bridge — from software engineer who learned the math two decades ago, up to where frontier-lab work actually lives. Real compiled kernels, no toy Python. Every term explained. Recall built in.
I — Foundations Refreshed
- 01 Vectors, Dot Products & Norms Everything downstream — attention, embeddings, quantization, similarity search — is built from one operation you already know. Let's re-own it, precisely, and end with it running in hardware. draft
- 02 Matrices as transformations Matrices as functions, not number grids. Matmul as composition. The three independent axes. Orthogonal/rotation matrices. draft
- 03 Floating point, integers & quantization error IEEE-754 refreshed, fixed-point and integer arithmetic, where quantization error comes from. Kernel: int8 dot with _mm_maddubs_epi16. draft
- 04 Calculus & gradients refreshed Derivatives as sensitivity, the chain rule, Jacobians — quiet setup for backprop. draft
II — Probability, Geometry & Learning
- 05 Distributions, variance, expectation N(0,1), variance as spread, why 'isotropic vs anisotropic' is the whole game in quantization. draft
- 06 High-dimensional geometry Concentration of measure, near-orthogonality of random vectors, the curse — and how rotation fights it. draft
- 07 Random projections & Johnson–Lindenstrauss The theoretical license to compress: squash dimensions, preserve distances. Connects to QJL and the JL-correction in TurboQuant. draft
- 08 What 'learning' actually is Loss, gradient descent, SGD/Adam, regularization (then→now), overfitting. Modern terminology vs your 2007 version. draft
- 09 Backpropagation from scratch Autodiff, the computation graph, why frameworks exist. Kernel: a tiny hand-written autograd. draft
III — The Neural Network, Assembled
- 10 Neurons, layers, MLPs Perceptron → modern MLP, activation functions (ReLU/GELU/SwiGLU) and what changed. draft
- 11 Tokens & embeddings BPE tokenization, the embedding table, why text becomes vectors. Positional encoding → RoPE (the modern consensus). draft
- 12 Softmax & the exponential family Smooth argmax, numerical stability (the max-subtraction trick), online/streaming softmax. Kernel: vectorized stable softmax. draft
- 13 Attention, fully assembled QKᵀ, scaling, softmax, V; multi-head → GQA/MQA. FlashAttention as tiling + online softmax. The capstone of Part III. draft
- 14 Normalization & residuals LayerNorm → RMSNorm with full derivations, the residual stream, why deep nets train at all. draft
IV — What Makes an LLM
- 15 The GPT architecture, end to end Decoder-only stack, the residual stream as a highway, logits → sampling. Encoders vs decoders vs encoder-decoder — and why decoder-only won. draft
- 16 Pretraining Next-token prediction, the data pipeline, scaling laws (Chinchilla), what a 'token budget' is. draft
- 17 Mixture-of-Experts Routing, sparse activation, why capacity ≠ compute. Load-bearing in every 2025 frontier model. draft
- 18 Alignment: RLHF → DPO → GRPO Turning a next-token predictor into an assistant. Reward models, preference optimization, the modern simplifications. draft
- 19 Fine-tuning, LoRA, and PEFT How to adapt a pretrained model to your task without retraining all 70B parameters. Full fine-tuning vs feature-based extraction vs parameter-efficient methods (LoRA, QLoRA, adapters). The LoRA math — why a rank-8 update matrix captures most of the adaptation signal — falls out of the intrinsic-dimensionality argument (Aghajanyan 2020). draft
- 20 Reasoning models The 2024-2025 shift to test-time compute scaling. o1, R1, Claude extended thinking: train a base model, SFT on long chain-of-thought, then RL with verifiable rewards (GRPO + rule-based or process reward models). Inference cost goes 10-100× to buy quality. The biggest architectural shift of the post-Chinchilla era. draft
- 21 Beyond transformers SSMs and Mamba, why people are looking past attention, an honest assessment of where this stands. draft
V — The Systems That Run Them
- 22 The hardware substrate GPU memory hierarchy (HBM↔SRAM, the FlashAttention motivation generalized), Tensor Cores, TPUs, Apple Silicon, the roofline model. draft
- 23 Runtimes & frameworks PyTorch (eager vs compiled), CUDA, ONNX, MLX. What a 'kernel' is, the dispatch stack. A real CUDA dot product alongside its CPU SIMD twin. draft
- 24 Inference at scale KV cache, PagedAttention, continuous batching, prefill/decode disaggregation, speculative decoding (vLLM anatomy). draft
- 25 Training at scale Data / tensor / pipeline / expert / context parallelism, all-reduce, ZeRO/sharding. Why multi-node is a systems problem you already half-understand. draft
- 26 Quantization in practice PTQ basics (int8/int4, scale + zero point), the LLM.int8/GPTQ/AWQ family, the GGML/llama.cpp quantization family (q4_0..q6_K, q4_K_M vs q4_K_S, IQ-quants, imatrix), and quantization-aware training (STE, BitNet, QLoRA). draft
VI — The Frontier
- 27 Vector search & ANN Exact kNN → HNSW; traversal as a reduction; rotation-based quantization (RaBitQ/TurboQuant) from first principles, landing on the open research seam. draft
- 28 Reading research like a researcher How to attack a paper, what frontier-lab work actually looks like day to day, where the open problems are. draft
S — Spotlight: production accelerators