What 'learning' actually is

§1 Loss functions & empirical risk
A loss function is the contract between you and the model — it says which mistakes count and how much. MSE is fast but sensitive to outliers; MAE is robust but non-smooth at zero; Huber blends them; cross-entropy is the right loss for probabilistic classification. Empirical-risk minimisation is what 'training' literally is.
§2 Gradient descent & SGD
Gradient descent walks downhill on the loss surface; SGD does the same with one mini-batch's worth of gradient at a time, trading exact steps for cheap ones. The √N law from Ch.5 §1 tells you exactly how noisy each step is.
§3 Momentum, Adam, AdamW
SGD with momentum builds velocity along consistent gradient directions. RMSProp scales per-coordinate by historical gradient magnitudes. Adam combines both. AdamW decouples weight decay from the gradient — and is the default optimiser for every major LLM training run since 2020.