WHAT 'LEARNING' ACTUALLY IS
Section 8.3
03

Momentum, Adam, AdamW

SGD with a fixed learning rate is the simplest optimiser. It’s also slow on ill-conditioned loss surfaces — which is to say, on every real neural-network training run. Three additions, layered one at a time, make modern training tractable: momentum (build velocity along consistent gradient directions), per-coordinate adaptive learning rates (RMSProp — scale each coordinate’s step by its historical gradient magnitude), and the combination of the two (Adam). Then one more refinement that mattered for transformers: AdamW decouples weight decay from the gradient, fixing a subtle bias in plain Adam. Every published LLM training run since GPT-3 uses AdamW. This section walks the lineage and runs them head-to-head on a model loss surface.

Momentum — the heavy-ball method

The intuition: instead of moving in the direction of the current gradient, move in the direction of a moving average of recent gradients. This builds up velocity in directions where gradients consistently agree and damps oscillations in directions where they don’t.

v_{t+1} = β · v_t + ∇L(θ_t) ← exponential moving average of gradients θ_{t+1} = θ_t − η · v_{t+1} β ∈ [0, 1) is the momentum coefficient — typically 0.9. At β = 0: pure SGD. At β → 1: very heavy ball (most weight on past gradients).

Momentum accelerates convergence on landscapes where the gradient direction is roughly stable (long ravines, smooth valleys). It also damps oscillations across narrow ravines — without momentum, SGD bounces between the two walls of the valley; with momentum, the bouncing cancels and the residual is the down-valley direction.

RMSProp — per-coordinate adaptive learning rate

A different fix for the same problem (ill-conditioned losses). Different coordinates can have wildly different gradient magnitudes — and what works as a learning rate for one coordinate is too small or too large for others. RMSProp (Hinton 2012, unpublished but widely-cited Coursera-lecture origin) tracks per-coordinate gradient magnitudes:

s_{t+1} = β₂ · s_t + (1 − β₂) · g_t² ← per-coord moving average of squared grads θ_{t+1} = θ_t − η · g_t / (√s_{t+1} + ε) (g_t is the current gradient; everything is per-coordinate.)

The result: coordinates with consistently large gradients get smaller effective steps; coordinates with small gradients get larger steps. This is adaptive per-coordinate scaling — every parameter has its own learning rate, automatically tuned.

Adam — combine both

Kingma & Ba, “Adam: A Method for Stochastic Optimization” (ICLR 2015). Add momentum on top of RMSProp:

m_t = β₁ · m_{t-1} + (1 − β₁) · g_t ← first moment (momentum) s_t = β₂ · s_{t-1} + (1 − β₂) · g_t² ← second moment (RMSProp) Bias correction (since m and s start at zero): m̂_t = m_t / (1 − β₁ᵗ) ŝ_t = s_t / (1 − β₂ᵗ) θ_{t+1} = θ_t − η · m̂_t / (√ŝ_t + ε) Typical defaults: β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸, η = 10⁻³

Adam combines momentum’s velocity-along-direction advantage with RMSProp’s per-coordinate scaling. The bias-correction terms (the division by 1 − β₁ᵗ) compensate for the fact that m and s start at zero — early in training they’d be biased small without correction.

Adam’s per-coordinate adaptive learning rate is what makes it so robust to bad initial conditions. SGD with a single global learning rate has to balance “too small for some coordinates” against “too large for others”; Adam picks per-coordinate rates automatically based on observed gradient magnitudes.

step 0
optstartSGDSGD + momentumRMSPropAdam
SGD
dist to opt: 2.0396
SGD + momentum
dist to opt: 2.0396
RMSProp
dist to opt: 2.0396
Adam
dist to opt: 2.0396
Four optimisers on the same loss surface (a Rosenbrock-shaped curving valley). Step through them. SGD walks straight down the gradient, often crossing the valley repeatedly. Momentum builds velocity along the valley. RMSProp scales the step per-coordinate by historical gradient magnitudes. Adam combines both. Watch which one gets closest to the optimum at (1, 1).

Drag the learning rate; play through the optimisers. Watch SGD bounce, momentum streak down the valley, RMSProp adapt per-coordinate, Adam combine the two. Reset and try different learning rates — each optimiser has its own optimal lr.

AdamW — fix weight decay

The subtle problem with vanilla Adam: when you add L2 regularisation (weight decay) to the loss, the gradient of that L2 term gets pushed through Adam’s per-coordinate scaling — which makes the effective weight decay vary per coordinate, defeating its purpose.

Loshchilov & Hutter, “Decoupled Weight Decay Regularization” (ICLR 2019). AdamW’s fix: apply weight decay as a separate update step, not through the gradient.

Adam (broken weight decay): gradient_with_l2 = g_t + λ · θ_t ← L2 grad mixed into gradient ... (Adam updates as above using gradient_with_l2) ← per-coord scaling distorts λθ_t AdamW (decoupled): Adam update using ONLY g_t: m_t = β₁ m_{t-1} + (1 − β₁) g_t (gradient only, no L2 term) s_t = β₂ s_{t-1} + (1 − β₂) g_t² θ_{t+1} = θ_t − η · m̂_t / (√ŝ_t + ε) Then DECOUPLED weight decay: θ_{t+1} ← θ_{t+1} − η · λ · θ_t ← fixed shrinkage, every coord

The difference is small algebraically but matters operationally: AdamW gives every coordinate the same effective weight decay, exactly as intended. Across all major LLM training runs since GPT-3 (2020), AdamW is the optimiser; vanilla Adam is essentially deprecated for transformer training.

Now make it run

The kernel runs four optimisers on the same 2D loss (a curving Rosenbrock-shaped valley) with η = 0.01 for 2000 steps each. The results:

Four optimisers on (1−x)² + 8(y−x²)² from (-1.0, 1.4)
lr=0.01  β₁=0.9  β₂=0.999  steps=2000

optimiser              final x      final y      dist to (1,1)
SGD                    0.999834     0.999659     0.000379
SGD + momentum         1.000000     1.000000     0.000000
RMSProp                0.995098     0.989759     0.011354
Adam                   0.999999     0.999999     0.000001

Momentum hits the optimum essentially exactly. Adam is within 1e-6. RMSProp is slower (the missing momentum-style velocity hurts on this curving-valley landscape). SGD converges but takes many steps to overcome the ill-conditioning.

sgd.c (the four loops) C · four optimisers on the Rosenbrock-shaped valley
    {
        double x = x0, y = y0;
        for (int t = 0; t < STEPS; t++) {
            double gx, gy; grad(x, y, &gx, &gy);
            x -= lr * gx; y -= lr * gy;
        }
        printf("%-22s %-12.6f %-12.6f %-14.6f\n", "SGD", x, y, dist_to_opt(x, y));
    }

    /* SGD + momentum */
    {
        double x = x0, y = y0;
        double vx = 0, vy = 0;
        for (int t = 0; t < STEPS; t++) {
            double gx, gy; grad(x, y, &gx, &gy);
            vx = beta1 * vx + gx;
            vy = beta1 * vy + gy;
            x -= lr * vx; y -= lr * vy;
        }
        printf("%-22s %-12.6f %-12.6f %-14.6f\n", "SGD + momentum", x, y, dist_to_opt(x, y));
    }

    /* RMSProp */
    {
        double x = x0, y = y0;
        double sx = 0, sy = 0;
        for (int t = 0; t < STEPS; t++) {
            double gx, gy; grad(x, y, &gx, &gy);
            sx = beta2 * sx + (1 - beta2) * gx * gx;
            sy = beta2 * sy + (1 - beta2) * gy * gy;
            x -= lr * gx / (sqrt(sx) + eps);
            y -= lr * gy / (sqrt(sy) + eps);
        }
        printf("%-22s %-12.6f %-12.6f %-14.6f\n", "RMSProp", x, y, dist_to_opt(x, y));
    }

    /* Adam */
    {
        double x = x0, y = y0;
        double mx = 0, my = 0, sx = 0, sy = 0;
        for (int t = 1; t <= STEPS; t++) {
            double gx, gy; grad(x, y, &gx, &gy);
            mx = beta1 * mx + (1 - beta1) * gx;
            my = beta1 * my + (1 - beta1) * gy;
            sx = beta2 * sx + (1 - beta2) * gx * gx;
            sy = beta2 * sy + (1 - beta2) * gy * gy;
            double mhx = mx / (1 - pow(beta1, t));
            double mhy = my / (1 - pow(beta1, t));
            double shx = sx / (1 - pow(beta2, t));
            double shy = sy / (1 - pow(beta2, t));
            x -= lr * mhx / (sqrt(shx) + eps);
            y -= lr * mhy / (sqrt(shy) + eps);
        }
— think, then check —

v ← β · v + ∇L; θ ← θ − η · v.

β is the momentum coefficient — typically 0.9. It controls how much weight is given to past gradients vs the current one. β = 0 is pure SGD. β → 1 makes the update behave like a heavy ball with lots of inertia.

Operational effect: accelerates in directions where the gradient is consistent (the past and current gradients agree, so the velocity grows), damps oscillation in directions where the gradient flips back and forth (past and current cancel). On ill-conditioned losses (narrow valleys), momentum is the single biggest improvement over plain SGD — often 10× fewer steps to convergence.

— think, then check —

m_t = β₁ m_{t-1} + (1 − β₁) g_t ← first moment: momentum-style moving average of gradients

s_t = β₂ s_{t-1} + (1 − β₂) g_t² ← second moment: RMSProp-style moving average of squared gradients

m̂_t = m_t / (1 − β₁ᵗ) ← bias correction: m starts at 0, so early estimates are biased toward 0

ŝ_t = s_t / (1 − β₂ᵗ) ← bias correction for s

θ_{t+1} = θ_t − η · m̂_t / (√ŝ_t + ε) ← combined update: take a velocity step, scaled per-coordinate

The m part is momentum (accelerate consistent directions). The √s in the denominator is RMSProp (each coord gets its own effective learning rate based on its gradient magnitude). The combination is what makes Adam robust to bad initialisations and ill-conditioned losses — the default for almost all NN training between 2015 and the rise of AdamW.

The full optimiser timeline

YearOptimiserKey idea
1951SGD (Robbins-Monro)Use a noisy gradient estimate
1964Heavy ball (Polyak)Add momentum
1983Nesterov accelerated gradient”Look-ahead” momentum variant
2011AdaGradPer-coord adaptive lr (Σ g² in denominator)
2012RMSProp (Hinton)Replace Σ g² with EMA of g²
2015Adam (Kingma, Ba)RMSProp + momentum + bias correction
2017AdamW (Loshchilov, Hutter)Decoupled weight decay
2024Lion (Chen et al.)Sign-based update, lower memory
2024MuonNewton-Schulz preconditioner

Past Adam, the field has explored many variants — Lion (which uses only the sign of the gradient, reducing optimiser memory state by 2×) and Muon (using a Newton-Schulz preconditioner for the weight matrices). For LLM training as of late 2025, AdamW is still the universal default; the variants have specific niches (Lion for memory-constrained training, Muon for some matrix-structure-heavy workloads) but haven’t displaced AdamW from the mainstream.

— think, then check —

The problem: when you add λθ to the gradient (L2 regularisation), Adam’s per-coordinate scaling 1/(√s + ε) is applied to BOTH the data gradient g AND the weight-decay term λθ. The effective weight decay seen by the parameters is η · λθ / (√s + ε) — which varies per coordinate based on the gradient magnitude history (the √s in the denominator).

Operational consequence: coordinates with large historical gradients get LESS weight decay (because √s in denominator is large). Coordinates with small gradients get MORE weight decay. This is the OPPOSITE of what you want — you want uniform shrinkage, but you’re getting differential shrinkage based on whatever the gradients happened to look like during training.

AdamW’s fix: compute the Adam update using ONLY the data gradient g (not g + λθ). Then, after the Adam update, apply weight decay as a separate, decoupled step: θ ← θ − η · λ · θ. This shrinkage is uniform across coordinates — no per-coord scaling, just multiplicative decay.

The two updates (Adam and weight decay) are now decoupled: Adam handles the gradient signal, weight decay handles the regularisation, and neither interferes with the other.

Empirically, the difference is huge: AdamW + appropriate weight decay generalises 1–3 points better than Adam + L2 with the same nominal λ. Loshchilov & Hutter 2017’s paper shows this on CIFAR, ImageNet, and Penn Treebank LM. For transformer LLM training, AdamW with λ ∈ [0.01, 0.1] is universal — GPT-3, LLaMA, Qwen, DeepSeek, Mistral all use AdamW.

END OF CH.8 — What ‘learning’ actually is.
END OF PART II — Probability, Geometry & Learning.

§1 (loss functions, empirical risk) · §2 (gradient descent + SGD, the √B noise rule) · §3 (momentum, Adam, AdamW lineage).

Part II gave you the conceptual heart: probability with its √N law (Ch.5), the high-D geometry that makes high-dim ML weird and useful (Ch.6), the Johnson-Lindenstrauss lemma as constructive use of that geometry (Ch.7), and finally how all this comes together in the learning loop itself — loss + gradient descent + the modern optimiser family (Ch.8).

Coming next: Part III, Ch.9 — Backpropagation from scratch. Building on Ch.4 §3’s tape autograd, scaled to vector ops and the full backprop algorithm that every modern training run uses.