Gradient descent & SGD

Section 8.2

Gradient descent & SGD

§8.1 set up the goal: minimise empirical risk R̂(θ) = (1/N)Σ L(f_θ(xᵢ), yᵢ). This section is how. Gradient descent takes small steps in the direction −∇R̂(θ); stochastic gradient descent (SGD) uses one randomly-sampled mini-batch’s gradient per step instead of averaging across the whole dataset. The mini-batch trade is the entire reason modern ML works: averaging a gradient over N = 10⁹ examples per step is prohibitive, but each mini-batch’s gradient is an unbiased estimate of the full-batch gradient — and by the √N law (Ch.5 §1), the noise on each step scales as 1/√B for batch size B. So a mini-batch of size 512 gives you a gradient with noise ~σ/√512 ≈ σ/23 — usable for almost any optimiser.

Gradient descent

The simplest gradient method. Pick a learning rate η; iterate:

θ_{t+1} = θ_t − η · ∇R̂(θ_t) where ∇R̂(θ) = (1/N) Σ ∇L(f_θ(xᵢ), yᵢ) ← average gradient over ALL N training examples

Each step requires one full pass over the training set to compute the gradient. For modern NN training (N = 10⁹ tokens, model parameter gradients computed via Ch.4 §3’s backprop), that’s prohibitively expensive — a single gradient step would take longer than most jobs allow.

But the structure is right. Each step moves θ in the direction that locally decreases R̂ fastest. Pick η small enough and you converge to a local minimum (provably, for convex losses; empirically, for the highly non-convex NN losses).

Stochastic gradient descent

The trick: don’t average over all N examples each step. Sample B of them (the mini-batch) and compute the gradient on just those:

θ_{t+1} = θ_t − η · ∇R̂_B(θ_t) where ∇R̂_B(θ) = (1/B) Σ_{i ∈ batch} ∇L(f_θ(xᵢ), yᵢ) ← average over B examples only The batch is freshly sampled each step (e.g. by shuffling and going through the data in order).

This is stochastic gradient descent core algorithm Gradient descent using a single mini-batch's gradient per step instead of the full dataset gradient. The mini-batch gradient is an unbiased estimate of the full-batch gradient; the per-step noise scales as 1/√B. Trades step quality (noisier) for step throughput (B = 512 examples ≪ 10⁹). Foundation of every modern NN training pipeline. Then → now: Robbins & Monro 1951 originally formulated stochastic approximation (the theoretical foundation of SGD). The mini-batch version was implicit but became prominent in the early 2010s with the deep-learning boom. By 2015 every published training recipe specified batch size, learning rate, and number of steps as the three essential hyperparameters. The recipe is now so standardised it's almost invisible. — stochastic because the gradient is now a random variable (depends on which batch we sampled). Two key properties:

Unbiased. E[∇R̂_B(θ)] = ∇R̂(θ): the mini-batch gradient, in expectation across all possible random batches, equals the full-batch gradient. So SGD takes the right step in expectation; it just adds noise.
Noisy. By the √N law from Ch.5 §1, the standard deviation of ∇R̂_B is σ_∇ / √B where σ_∇ is the per-example gradient variance. Doubling B reduces gradient noise by √2; quadrupling B halves it.

The trade is: each SGD step is N/B times cheaper than a full-batch step, but noisier by a factor of √(N/B). For N = 10⁹ and B = 512, you’ve made each step ~2 million times cheaper at the cost of ~1400× more noise per step. The numbers look bad, but they work — because the noise averages out across steps. Train for 10⁵ SGD steps, and you’ve effectively averaged across 10⁵ × 512 = 5 × 10⁷ random examples, which is a low-noise estimate of the full-batch gradient direction integrated over the training trajectory.

SGD’s noise is also not always bad. Empirically, the noise helps escape saddle points and shallow local minima — a fully-converged gradient method on a non-convex landscape can get stuck where SGD doesn’t. The “implicit regularisation” of SGD is one of the modern explanations for why deeply over-parameterised networks generalise well despite VC bounds predicting overfitting (§8.1).

— think, then check —

θ_{t+1} = θ_t − η · ∇R̂(θ_t).

η controls how big a step we take in the direction of steepest descent. Too small: training takes forever (many tiny steps). Too large: each step overshoots and the loss diverges. Most production training schedules ramp η up at the start (warmup) and decay it later (cosine / step / linear decay), to balance fast initial progress with eventual fine convergence.

↳ §8.2 GD

Batch size = compute vs noise

The √N law tells you how to pick batch size:

gradient noise per step ~ σ_∇ / √B compute per step ~ B (linear in batch size for memory-bound training) So: cost per "unit of noise reduction" is proportional to: (compute per step) × (noise² per step) ~ B × σ_∇² / B = σ_∇² → The cost of reaching a given training accuracy is approximately INDEPENDENT of B in the noise-limited regime. You can pick B based on hardware utilisation, not noise.

This is the critical batch size performance result The largest batch size at which doubling B halves the wall-clock to a fixed training accuracy. Below it, SGD is noise-limited and bigger batches help linearly; above it, the noise floor is already low and bigger batches just waste compute. Determined empirically per workload — McCandlish et al. 2018 ('An Empirical Model of Large-Batch Training,' OpenAI) gives the formal framework. Typical values: 256 (small LM) to 10M+ (large LM training). argument from McCandlish et al. 2018 (“An Empirical Model of Large-Batch Training,” OpenAI). Below the critical batch size, doubling B halves wall-clock to a given accuracy. Above it, bigger batches don’t help — the noise is already small enough that further reduction is wasted. The critical batch size is workload-dependent: ~256 for small classification, ~1M for large LM training, ~10M for very large frontier-model training.

Production practice: set B to the largest value that fits in memory (or that respects the critical batch size, whichever is smaller), tune learning rate accordingly (the “linear scaling rule”: double B, double η).

— think, then check —

Mini-batch gradient noise scales as σ/√B. To halve the noise: 1/√B should halve, so B should quadruple. New B = 256.

Cost: 4× the compute per step (since per-step compute is linear in B for memory-bound NN training). 4× compute for 2× noise reduction is the standard “√N” trade — diminishing returns.

Operational consequence: there’s a sweet spot (‘critical batch size,’ McCandlish 2018) above which further batch increases waste compute. Below it, batch increases linearly speed convergence. Production training picks B at or just below the critical batch size to maximise compute-to-progress efficiency. Linear scaling rule (Goyal et al. 2017): double B → double η to maintain convergence speed.

↳ §8.2 + Ch.5 §1 √N law

— think, then check —

In theory: the mini-batch gradient is an UNBIASED estimate of the full-batch gradient. E[∇R̂_B(θ)] = ∇R̂(θ) — the per-batch noise averages out across steps. So over many steps, SGD converges to the SAME parameter region as full-batch GD, just along a noisier trajectory.

In practice: the noise of SGD is often a feature, not a bug. Three empirical reasons:

Saddle escape. NN loss surfaces are non-convex with many saddle points. Full-batch GD gets stuck at saddles where the gradient is small. SGD’s noise kicks it out.
Implicit regularisation. SGD’s trajectory has a measurable bias toward “flatter” minima (those with lower curvature around them), which empirically generalise better.
Wall-clock compute. A full-batch step on N = 10⁹ examples costs 10⁹ × per-example compute. A mini-batch step costs 512 × per-example compute. Even if SGD takes 10× more steps to converge, it’s still 10⁵× faster in wall-clock to reach the same loss.

So SGD is not ‘approximate GD’ — it’s a different algorithm whose noisy trajectory has practical advantages over pure GD. Modern ML wouldn’t exist without the mini-batch trade; gradient noise is part of the system, not an obstacle to it.

↳ §8.2 + §8.1 ERM

END OF CH.8 §2 — Gradient descent & SGD.
Three recall items: easy (the update rule and what η controls), medium (the √B noise scaling and batch-size selection), hard (why mini-batch SGD is not ‘approximate’ GD but a different — and often better — algorithm).
Coming next: §8.3 — Momentum, RMSProp, Adam, AdamW. The modern optimiser family that makes large-model training tractable.