Momentum, Adam, AdamW

Section 8.3

Momentum, Adam, AdamW

SGD with a fixed learning rate is the simplest optimiser. It’s also slow on ill-conditioned loss surfaces — which is to say, on every real neural-network training run. Three additions, layered one at a time, make modern training tractable: momentum (build velocity along consistent gradient directions), per-coordinate adaptive learning rates (RMSProp — scale each coordinate’s step by its historical gradient magnitude), and the combination of the two (Adam). Then one more refinement that mattered for transformers: AdamW decouples weight decay from the gradient, fixing a subtle bias in plain Adam. Every published LLM training run since GPT-3 uses AdamW. This section walks the lineage and runs them head-to-head on a model loss surface.

Momentum — the heavy-ball method

The intuition: instead of moving in the direction of the current gradient, move in the direction of a moving average of recent gradients. This builds up velocity in directions where gradients consistently agree and damps oscillations in directions where they don’t.

v_{t+1} = β · v_t + ∇L(θ_t) ← exponential moving average of gradients θ_{t+1} = θ_t − η · v_{t+1} β ∈ [0, 1) is the momentum coefficient — typically 0.9. At β = 0: pure SGD. At β → 1: very heavy ball (most weight on past gradients).

Momentum optimiser An SGD variant that takes the gradient step in the direction of an exponentially-weighted moving average of recent gradients (the 'velocity'). Update: v ← β v + g; θ ← θ − η v. The β term (typically 0.9) accelerates convergence in directions of consistent gradient and damps oscillation in directions of conflicting gradient. Robbins-Monro 1951 had it implicitly; Polyak 1964 made it formal ('heavy ball method'); modern NN training uses it everywhere. Then → now: Polyak's 'heavy ball' (1964) is the original momentum formulation. Sutskever et al. 2013 ('On the importance of initialization and momentum in deep learning') reintroduced it to the deep-learning community along with Nesterov's variant. Default β = 0.9 has been standard since. accelerates convergence on landscapes where the gradient direction is roughly stable (long ravines, smooth valleys). It also damps oscillations across narrow ravines — without momentum, SGD bounces between the two walls of the valley; with momentum, the bouncing cancels and the residual is the down-valley direction.

RMSProp — per-coordinate adaptive learning rate

A different fix for the same problem (ill-conditioned losses). Different coordinates can have wildly different gradient magnitudes — and what works as a learning rate for one coordinate is too small or too large for others. RMSProp optimiser Root Mean Square Propagation. Tracks an exponential moving average of squared gradients per coordinate, then scales each gradient by 1/√(that average). Update: s ← β₂ s + (1 − β₂) g²; θ ← θ − η g / (√s + ε). Effectively gives every coordinate its own learning rate, scaled by recent gradient magnitudes. Hinton 2012 (Coursera lecture); never formally published but immediately adopted. (Hinton 2012, unpublished but widely-cited Coursera-lecture origin) tracks per-coordinate gradient magnitudes:

s_{t+1} = β₂ · s_t + (1 − β₂) · g_t² ← per-coord moving average of squared grads θ_{t+1} = θ_t − η · g_t / (√s_{t+1} + ε) (g_t is the current gradient; everything is per-coordinate.)

The result: coordinates with consistently large gradients get smaller effective steps; coordinates with small gradients get larger steps. This is adaptive per-coordinate scaling — every parameter has its own learning rate, automatically tuned.

Adam — combine both

Kingma & Ba, “Adam: A Method for Stochastic Optimization” (ICLR 2015). Add momentum on top of RMSProp:

m_t = β₁ · m_{t-1} + (1 − β₁) · g_t ← first moment (momentum) s_t = β₂ · s_{t-1} + (1 − β₂) · g_t² ← second moment (RMSProp) Bias correction (since m and s start at zero): m̂_t = m_t / (1 − β₁ᵗ) ŝ_t = s_t / (1 − β₂ᵗ) θ_{t+1} = θ_t − η · m̂_t / (√ŝ_t + ε) Typical defaults: β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸, η = 10⁻³

Adam optimiser Adaptive Moment estimation. Combines momentum (first-moment exponential moving average of gradients) with per-coordinate scaling (second-moment moving average of squared gradients). Kingma & Ba 2015. With bias correction terms. The default optimiser for most ML training until AdamW (Loshchilov & Hutter 2017) supplanted it for transformer training. Defaults (β₁=0.9, β₂=0.999, η=10⁻³) work surprisingly well across many tasks. Then → now: Kingma & Ba's 2015 paper has 100K+ citations. Adam became the default for almost all DL workloads within a year of publication. AdamW (Loshchilov & Hutter 2017) is the transformer-era refinement. combines momentum’s velocity-along-direction advantage with RMSProp’s per-coordinate scaling. The bias-correction terms (the division by 1 − β₁ᵗ) compensate for the fact that m and s start at zero — early in training they’d be biased small without correction.

Adam’s per-coordinate adaptive learning rate is what makes it so robust to bad initial conditions. SGD with a single global learning rate has to balance “too small for some coordinates” against “too large for others”; Adam picks per-coordinate rates automatically based on observed gradient magnitudes.

learning rate η 0.0010 β₁ (momentum) 0.90 β₂ (RMSProp/Adam) 0.999

step 0

SGD

dist to opt: 2.0396

SGD + momentum

dist to opt: 2.0396

RMSProp

dist to opt: 2.0396

Adam

dist to opt: 2.0396

Four optimisers on the same loss surface (a Rosenbrock-shaped curving valley). Step through them. SGD walks straight down the gradient, often crossing the valley repeatedly. Momentum builds velocity along the valley. RMSProp scales the step per-coordinate by historical gradient magnitudes. Adam combines both. Watch which one gets closest to the optimum at (1, 1).

Drag the learning rate; play through the optimisers. Watch SGD bounce, momentum streak down the valley, RMSProp adapt per-coordinate, Adam combine the two. Reset and try different learning rates — each optimiser has its own optimal lr.

AdamW — fix weight decay

The subtle problem with vanilla Adam: when you add L2 regularisation (weight decay) to the loss, the gradient of that L2 term gets pushed through Adam’s per-coordinate scaling — which makes the effective weight decay vary per coordinate, defeating its purpose.

Loshchilov & Hutter, “Decoupled Weight Decay Regularization” (ICLR 2019). AdamW’s fix: apply weight decay as a separate update step, not through the gradient.

Adam (broken weight decay): gradient_with_l2 = g_t + λ · θ_t ← L2 grad mixed into gradient ... (Adam updates as above using gradient_with_l2) ← per-coord scaling distorts λθ_t AdamW (decoupled): Adam update using ONLY g_t: m_t = β₁ m_{t-1} + (1 − β₁) g_t (gradient only, no L2 term) s_t = β₂ s_{t-1} + (1 − β₂) g_t² θ_{t+1} = θ_t − η · m̂_t / (√ŝ_t + ε) Then DECOUPLED weight decay: θ_{t+1} ← θ_{t+1} − η · λ · θ_t ← fixed shrinkage, every coord

The difference is small algebraically but matters operationally: AdamW gives every coordinate the same effective weight decay, exactly as intended. Across all major LLM training runs since GPT-3 (2020), AdamW is the optimiser; vanilla Adam is essentially deprecated for transformer training.

Now make it run

The kernel runs four optimisers on the same 2D loss (a curving Rosenbrock-shaped valley) with η = 0.01 for 2000 steps each. The results:

Four optimisers on (1−x)² + 8(y−x²)² from (-1.0, 1.4)
lr=0.01  β₁=0.9  β₂=0.999  steps=2000

optimiser              final x      final y      dist to (1,1)
SGD                    0.999834     0.999659     0.000379
SGD + momentum         1.000000     1.000000     0.000000
RMSProp                0.995098     0.989759     0.011354
Adam                   0.999999     0.999999     0.000001

Momentum hits the optimum essentially exactly. Adam is within 1e-6. RMSProp is slower (the missing momentum-style velocity hurts on this curving-valley landscape). SGD converges but takes many steps to overcome the ill-conditioning.

sgd.c (the four loops) C · four optimisers on the Rosenbrock-shaped valley

    {
        double x = x0, y = y0;
        for (int t = 0; t < STEPS; t++) {
            double gx, gy; grad(x, y, &gx, &gy);
            x -= lr * gx; y -= lr * gy;
        }
        printf("%-22s %-12.6f %-12.6f %-14.6f\n", "SGD", x, y, dist_to_opt(x, y));
    }

    /* SGD + momentum */
    {
        double x = x0, y = y0;
        double vx = 0, vy = 0;
        for (int t = 0; t < STEPS; t++) {
            double gx, gy; grad(x, y, &gx, &gy);
            vx = beta1 * vx + gx;
            vy = beta1 * vy + gy;
            x -= lr * vx; y -= lr * vy;
        }
        printf("%-22s %-12.6f %-12.6f %-14.6f\n", "SGD + momentum", x, y, dist_to_opt(x, y));
    }

    /* RMSProp */
    {
        double x = x0, y = y0;
        double sx = 0, sy = 0;
        for (int t = 0; t < STEPS; t++) {
            double gx, gy; grad(x, y, &gx, &gy);
            sx = beta2 * sx + (1 - beta2) * gx * gx;
            sy = beta2 * sy + (1 - beta2) * gy * gy;
            x -= lr * gx / (sqrt(sx) + eps);
            y -= lr * gy / (sqrt(sy) + eps);
        }
        printf("%-22s %-12.6f %-12.6f %-14.6f\n", "RMSProp", x, y, dist_to_opt(x, y));
    }

    /* Adam */
    {
        double x = x0, y = y0;
        double mx = 0, my = 0, sx = 0, sy = 0;
        for (int t = 1; t <= STEPS; t++) {
            double gx, gy; grad(x, y, &gx, &gy);
            mx = beta1 * mx + (1 - beta1) * gx;
            my = beta1 * my + (1 - beta1) * gy;
            sx = beta2 * sx + (1 - beta2) * gx * gx;
            sy = beta2 * sy + (1 - beta2) * gy * gy;
            double mhx = mx / (1 - pow(beta1, t));
            double mhy = my / (1 - pow(beta1, t));
            double shx = sx / (1 - pow(beta2, t));
            double shy = sy / (1 - pow(beta2, t));
            x -= lr * mhx / (sqrt(shx) + eps);
            y -= lr * mhy / (sqrt(shy) + eps);
        }

— think, then check —

v ← β · v + ∇L; θ ← θ − η · v.

β is the momentum coefficient — typically 0.9. It controls how much weight is given to past gradients vs the current one. β = 0 is pure SGD. β → 1 makes the update behave like a heavy ball with lots of inertia.

Operational effect: accelerates in directions where the gradient is consistent (the past and current gradients agree, so the velocity grows), damps oscillation in directions where the gradient flips back and forth (past and current cancel). On ill-conditioned losses (narrow valleys), momentum is the single biggest improvement over plain SGD — often 10× fewer steps to convergence.

↳ §8.3 momentum

— think, then check —

m_t = β₁ m_{t-1} + (1 − β₁) g_t ← first moment: momentum-style moving average of gradients

s_t = β₂ s_{t-1} + (1 − β₂) g_t² ← second moment: RMSProp-style moving average of squared gradients

m̂_t = m_t / (1 − β₁ᵗ) ← bias correction: m starts at 0, so early estimates are biased toward 0

ŝ_t = s_t / (1 − β₂ᵗ) ← bias correction for s

θ_{t+1} = θ_t − η · m̂_t / (√ŝ_t + ε) ← combined update: take a velocity step, scaled per-coordinate

The m part is momentum (accelerate consistent directions). The √s in the denominator is RMSProp (each coord gets its own effective learning rate based on its gradient magnitude). The combination is what makes Adam robust to bad initialisations and ill-conditioned losses — the default for almost all NN training between 2015 and the rise of AdamW.

↳ §8.3 Adam

The full optimiser timeline

Year	Optimiser	Key idea
1951	SGD (Robbins-Monro)	Use a noisy gradient estimate
1964	Heavy ball (Polyak)	Add momentum
1983	Nesterov accelerated gradient	”Look-ahead” momentum variant
2011	AdaGrad	Per-coord adaptive lr (Σ g² in denominator)
2012	RMSProp (Hinton)	Replace Σ g² with EMA of g²
2015	Adam (Kingma, Ba)	RMSProp + momentum + bias correction
2017	AdamW (Loshchilov, Hutter)	Decoupled weight decay
2024	Lion (Chen et al.)	Sign-based update, lower memory
2024	Muon	Newton-Schulz preconditioner

Past Adam, the field has explored many variants — Lion (which uses only the sign of the gradient, reducing optimiser memory state by 2×) and Muon (using a Newton-Schulz preconditioner for the weight matrices). For LLM training as of late 2025, AdamW is still the universal default; the variants have specific niches (Lion for memory-constrained training, Muon for some matrix-structure-heavy workloads) but haven’t displaced AdamW from the mainstream.

— think, then check —

The problem: when you add λθ to the gradient (L2 regularisation), Adam’s per-coordinate scaling 1/(√s + ε) is applied to BOTH the data gradient g AND the weight-decay term λθ. The effective weight decay seen by the parameters is η · λθ / (√s + ε) — which varies per coordinate based on the gradient magnitude history (the √s in the denominator).

Operational consequence: coordinates with large historical gradients get LESS weight decay (because √s in denominator is large). Coordinates with small gradients get MORE weight decay. This is the OPPOSITE of what you want — you want uniform shrinkage, but you’re getting differential shrinkage based on whatever the gradients happened to look like during training.

AdamW’s fix: compute the Adam update using ONLY the data gradient g (not g + λθ). Then, after the Adam update, apply weight decay as a separate, decoupled step: θ ← θ − η · λ · θ. This shrinkage is uniform across coordinates — no per-coord scaling, just multiplicative decay.

The two updates (Adam and weight decay) are now decoupled: Adam handles the gradient signal, weight decay handles the regularisation, and neither interferes with the other.

Empirically, the difference is huge: AdamW + appropriate weight decay generalises 1–3 points better than Adam + L2 with the same nominal λ. Loshchilov & Hutter 2017’s paper shows this on CIFAR, ImageNet, and Penn Treebank LM. For transformer LLM training, AdamW with λ ∈ [0.01, 0.1] is universal — GPT-3, LLaMA, Qwen, DeepSeek, Mistral all use AdamW.

↳ §8.3 AdamW

END OF CH.8 — What ‘learning’ actually is.
END OF PART II — Probability, Geometry & Learning.

§1 (loss functions, empirical risk) · §2 (gradient descent + SGD, the √B noise rule) · §3 (momentum, Adam, AdamW lineage).

Part II gave you the conceptual heart: probability with its √N law (Ch.5), the high-D geometry that makes high-dim ML weird and useful (Ch.6), the Johnson-Lindenstrauss lemma as constructive use of that geometry (Ch.7), and finally how all this comes together in the learning loop itself — loss + gradient descent + the modern optimiser family (Ch.8).

Coming next: Part III, Ch.9 — Backpropagation from scratch. Building on Ch.4 §3’s tape autograd, scaled to vector ops and the full backprop algorithm that every modern training run uses.