WHAT 'LEARNING' ACTUALLY IS
Section 8.2
02

Gradient descent & SGD

§8.1 set up the goal: minimise empirical risk R̂(θ) = (1/N)Σ L(f_θ(xᵢ), yᵢ). This section is how. Gradient descent takes small steps in the direction −∇R̂(θ); stochastic gradient descent (SGD) uses one randomly-sampled mini-batch’s gradient per step instead of averaging across the whole dataset. The mini-batch trade is the entire reason modern ML works: averaging a gradient over N = 10⁹ examples per step is prohibitive, but each mini-batch’s gradient is an unbiased estimate of the full-batch gradient — and by the √N law (Ch.5 §1), the noise on each step scales as 1/√B for batch size B. So a mini-batch of size 512 gives you a gradient with noise ~σ/√512 ≈ σ/23 — usable for almost any optimiser.

Gradient descent

The simplest gradient method. Pick a learning rate η; iterate:

θ_{t+1} = θ_t − η · ∇R̂(θ_t) where ∇R̂(θ) = (1/N) Σ ∇L(f_θ(xᵢ), yᵢ) ← average gradient over ALL N training examples

Each step requires one full pass over the training set to compute the gradient. For modern NN training (N = 10⁹ tokens, model parameter gradients computed via Ch.4 §3’s backprop), that’s prohibitively expensive — a single gradient step would take longer than most jobs allow.

But the structure is right. Each step moves θ in the direction that locally decreases fastest. Pick η small enough and you converge to a local minimum (provably, for convex losses; empirically, for the highly non-convex NN losses).

Stochastic gradient descent

The trick: don’t average over all N examples each step. Sample B of them (the mini-batch) and compute the gradient on just those:

θ_{t+1} = θ_t − η · ∇R̂_B(θ_t) where ∇R̂_B(θ) = (1/B) Σ_{i ∈ batch} ∇L(f_θ(xᵢ), yᵢ) ← average over B examples only The batch is freshly sampled each step (e.g. by shuffling and going through the data in order).

This is stochastic gradient descentstochastic because the gradient is now a random variable (depends on which batch we sampled). Two key properties:

The trade is: each SGD step is N/B times cheaper than a full-batch step, but noisier by a factor of √(N/B). For N = 10⁹ and B = 512, you’ve made each step ~2 million times cheaper at the cost of ~1400× more noise per step. The numbers look bad, but they work — because the noise averages out across steps. Train for 10⁵ SGD steps, and you’ve effectively averaged across 10⁵ × 512 = 5 × 10⁷ random examples, which is a low-noise estimate of the full-batch gradient direction integrated over the training trajectory.

SGD’s noise is also not always bad. Empirically, the noise helps escape saddle points and shallow local minima — a fully-converged gradient method on a non-convex landscape can get stuck where SGD doesn’t. The “implicit regularisation” of SGD is one of the modern explanations for why deeply over-parameterised networks generalise well despite VC bounds predicting overfitting (§8.1).

— think, then check —

θ_{t+1} = θ_t − η · ∇R̂(θ_t).

η controls how big a step we take in the direction of steepest descent. Too small: training takes forever (many tiny steps). Too large: each step overshoots and the loss diverges. Most production training schedules ramp η up at the start (warmup) and decay it later (cosine / step / linear decay), to balance fast initial progress with eventual fine convergence.

Batch size = compute vs noise

The √N law tells you how to pick batch size:

gradient noise per step ~ σ_∇ / √B compute per step ~ B (linear in batch size for memory-bound training) So: cost per "unit of noise reduction" is proportional to: (compute per step) × (noise² per step) ~ B × σ_∇² / B = σ_∇² → The cost of reaching a given training accuracy is approximately INDEPENDENT of B in the noise-limited regime. You can pick B based on hardware utilisation, not noise.

This is the critical batch size argument from McCandlish et al. 2018 (“An Empirical Model of Large-Batch Training,” OpenAI). Below the critical batch size, doubling B halves wall-clock to a given accuracy. Above it, bigger batches don’t help — the noise is already small enough that further reduction is wasted. The critical batch size is workload-dependent: ~256 for small classification, ~1M for large LM training, ~10M for very large frontier-model training.

Production practice: set B to the largest value that fits in memory (or that respects the critical batch size, whichever is smaller), tune learning rate accordingly (the “linear scaling rule”: double B, double η).

— think, then check —

Mini-batch gradient noise scales as σ/√B. To halve the noise: 1/√B should halve, so B should quadruple. New B = 256.

Cost: 4× the compute per step (since per-step compute is linear in B for memory-bound NN training). 4× compute for 2× noise reduction is the standard “√N” trade — diminishing returns.

Operational consequence: there’s a sweet spot (‘critical batch size,’ McCandlish 2018) above which further batch increases waste compute. Below it, batch increases linearly speed convergence. Production training picks B at or just below the critical batch size to maximise compute-to-progress efficiency. Linear scaling rule (Goyal et al. 2017): double B → double η to maintain convergence speed.

— think, then check —

In theory: the mini-batch gradient is an UNBIASED estimate of the full-batch gradient. E[∇R̂_B(θ)] = ∇R̂(θ) — the per-batch noise averages out across steps. So over many steps, SGD converges to the SAME parameter region as full-batch GD, just along a noisier trajectory.

In practice: the noise of SGD is often a feature, not a bug. Three empirical reasons:

  • Saddle escape. NN loss surfaces are non-convex with many saddle points. Full-batch GD gets stuck at saddles where the gradient is small. SGD’s noise kicks it out.
  • Implicit regularisation. SGD’s trajectory has a measurable bias toward “flatter” minima (those with lower curvature around them), which empirically generalise better.
  • Wall-clock compute. A full-batch step on N = 10⁹ examples costs 10⁹ × per-example compute. A mini-batch step costs 512 × per-example compute. Even if SGD takes 10× more steps to converge, it’s still 10⁵× faster in wall-clock to reach the same loss.

So SGD is not ‘approximate GD’ — it’s a different algorithm whose noisy trajectory has practical advantages over pure GD. Modern ML wouldn’t exist without the mini-batch trade; gradient noise is part of the system, not an obstacle to it.

END OF CH.8 §2 — Gradient descent & SGD.
Three recall items: easy (the update rule and what η controls), medium (the √B noise scaling and batch-size selection), hard (why mini-batch SGD is not ‘approximate’ GD but a different — and often better — algorithm).
Coming next: §8.3 — Momentum, RMSProp, Adam, AdamW. The modern optimiser family that makes large-model training tractable.