Gradient descent & SGD
§8.1 set up the goal: minimise empirical risk R̂(θ) = (1/N)Σ L(f_θ(xᵢ), yᵢ). This section is how. Gradient descent takes small steps in the direction −∇R̂(θ); stochastic gradient descent (SGD) uses one randomly-sampled mini-batch’s gradient per step instead of averaging across the whole dataset. The mini-batch trade is the entire reason modern ML works: averaging a gradient over N = 10⁹ examples per step is prohibitive, but each mini-batch’s gradient is an unbiased estimate of the full-batch gradient — and by the √N law (Ch.5 §1), the noise on each step scales as 1/√B for batch size B. So a mini-batch of size 512 gives you a gradient with noise ~σ/√512 ≈ σ/23 — usable for almost any optimiser.
Gradient descent
The simplest gradient method. Pick a learning rate η; iterate:
Each step requires one full pass over the training set to compute the gradient. For modern NN training (N = 10⁹ tokens, model parameter gradients computed via Ch.4 §3’s backprop), that’s prohibitively expensive — a single gradient step would take longer than most jobs allow.
But the structure is right. Each step moves θ in the direction that locally decreases R̂ fastest. Pick η small enough and you converge to a local minimum (provably, for convex losses; empirically, for the highly non-convex NN losses).
Stochastic gradient descent
The trick: don’t average over all N examples each step. Sample B of them (the mini-batch) and compute the gradient on just those:
This is stochastic gradient descent — stochastic because the gradient is now a random variable (depends on which batch we sampled). Two key properties:
- Unbiased. E[∇R̂_B(θ)] = ∇R̂(θ): the mini-batch gradient, in expectation across all possible random batches, equals the full-batch gradient. So SGD takes the right step in expectation; it just adds noise.
- Noisy. By the √N law from Ch.5 §1, the standard deviation of ∇R̂_B is σ_∇ / √B where σ_∇ is the per-example gradient variance. Doubling B reduces gradient noise by √2; quadrupling B halves it.
The trade is: each SGD step is N/B times cheaper than a full-batch step, but noisier by a factor of √(N/B). For N = 10⁹ and B = 512, you’ve made each step ~2 million times cheaper at the cost of ~1400× more noise per step. The numbers look bad, but they work — because the noise averages out across steps. Train for 10⁵ SGD steps, and you’ve effectively averaged across 10⁵ × 512 = 5 × 10⁷ random examples, which is a low-noise estimate of the full-batch gradient direction integrated over the training trajectory.
SGD’s noise is also not always bad. Empirically, the noise helps escape saddle points and shallow local minima — a fully-converged gradient method on a non-convex landscape can get stuck where SGD doesn’t. The “implicit regularisation” of SGD is one of the modern explanations for why deeply over-parameterised networks generalise well despite VC bounds predicting overfitting (§8.1).
θ_{t+1} = θ_t − η · ∇R̂(θ_t).
η controls how big a step we take in the direction of steepest descent. Too small: training takes forever (many tiny steps). Too large: each step overshoots and the loss diverges. Most production training schedules ramp η up at the start (warmup) and decay it later (cosine / step / linear decay), to balance fast initial progress with eventual fine convergence.
Batch size = compute vs noise
The √N law tells you how to pick batch size:
This is the critical batch size argument from McCandlish et al. 2018 (“An Empirical Model of Large-Batch Training,” OpenAI). Below the critical batch size, doubling B halves wall-clock to a given accuracy. Above it, bigger batches don’t help — the noise is already small enough that further reduction is wasted. The critical batch size is workload-dependent: ~256 for small classification, ~1M for large LM training, ~10M for very large frontier-model training.
Production practice: set B to the largest value that fits in memory (or that respects the critical batch size, whichever is smaller), tune learning rate accordingly (the “linear scaling rule”: double B, double η).
Mini-batch gradient noise scales as σ/√B. To halve the noise: 1/√B should halve, so B should quadruple. New B = 256.
Cost: 4× the compute per step (since per-step compute is linear in B for memory-bound NN training). 4× compute for 2× noise reduction is the standard “√N” trade — diminishing returns.
Operational consequence: there’s a sweet spot (‘critical batch size,’ McCandlish 2018) above which further batch increases waste compute. Below it, batch increases linearly speed convergence. Production training picks B at or just below the critical batch size to maximise compute-to-progress efficiency. Linear scaling rule (Goyal et al. 2017): double B → double η to maintain convergence speed.
In theory: the mini-batch gradient is an UNBIASED estimate of the full-batch gradient. E[∇R̂_B(θ)] = ∇R̂(θ) — the per-batch noise averages out across steps. So over many steps, SGD converges to the SAME parameter region as full-batch GD, just along a noisier trajectory.
In practice: the noise of SGD is often a feature, not a bug. Three empirical reasons:
- Saddle escape. NN loss surfaces are non-convex with many saddle points. Full-batch GD gets stuck at saddles where the gradient is small. SGD’s noise kicks it out.
- Implicit regularisation. SGD’s trajectory has a measurable bias toward “flatter” minima (those with lower curvature around them), which empirically generalise better.
- Wall-clock compute. A full-batch step on N = 10⁹ examples costs 10⁹ × per-example compute. A mini-batch step costs 512 × per-example compute. Even if SGD takes 10× more steps to converge, it’s still 10⁵× faster in wall-clock to reach the same loss.
So SGD is not ‘approximate GD’ — it’s a different algorithm whose noisy trajectory has practical advantages over pure GD. Modern ML wouldn’t exist without the mini-batch trade; gradient noise is part of the system, not an obstacle to it.
END OF CH.8 §2 — Gradient descent & SGD.
Three recall items: easy (the update rule and what η controls), medium (the √B noise scaling and batch-size selection), hard (why mini-batch SGD is not ‘approximate’ GD but a different — and often better — algorithm).
Coming next: §8.3 — Momentum, RMSProp, Adam, AdamW. The modern optimiser family that makes large-model training tractable.