Loss, gradient descent, SGD/Adam, regularization (then→now), overfitting. Modern terminology vs your 2007 version.
A loss function is the contract between you and the model — it says which mistakes count and how much. MSE is fast but sensitive to outliers; MAE is robust but non-smooth at zero; Huber blends them; cross-entropy is the right loss for probabilistic classification. Empirical-risk minimisation is what 'training' literally is.
Gradient descent walks downhill on the loss surface; SGD does the same with one mini-batch's worth of gradient at a time, trading exact steps for cheap ones. The √N law from Ch.5 §1 tells you exactly how noisy each step is.
SGD with momentum builds velocity along consistent gradient directions. RMSProp scales per-coordinate by historical gradient magnitudes. Adam combines both. AdamW decouples weight decay from the gradient — and is the default optimiser for every major LLM training run since 2020.
← ALL CHAPTERS