Neurons, layers, MLPs

§1 From perceptron to MLP
A perceptron is one linear classifier. An MLP is the same building block stacked with nonlinearities between layers. The structural unlock — universal approximation — falls out of stacking; the engineering choices (width vs depth, parameter count vs computational depth) are what distinguish a 1990s MLP from a 2025 transformer.
§2 Activation functions — ReLU, GELU, SiLU, SwiGLU
The choice of nonlinearity between linear layers was once a sleepy implementation detail. After 2012, it became one of the most consequential architecture decisions in deep learning. The lineage from sigmoid (vanishing gradients) through ReLU (the 2012 breakthrough) to GELU/SiLU/SwiGLU (the modern transformer family) is short, and each step was driven by an empirical observation about gradient flow.
§3 Train a tiny MLP end-to-end with AdamW
The capstone of Part III's first wave. Reuse Ch.9's 250-line autograd library, swap SGD for AdamW, train a 2-layer MLP on the two-moons dataset. Hits 100% accuracy in ~200 epochs. Everything from Ch.4 §3 (chain rule), Ch.8 §3 (AdamW), Ch.9 (vector autograd), and Ch.10 §§1-2 (MLP architecture, ReLU) running together.