Backpropagation from scratch

§1 Vector Jacobians & the VJP
Scalar autograd from Ch.4 §3 generalises to vector/matrix ops by replacing each scalar derivative with a Jacobian. The trick: never materialise the Jacobian — implement its vector-Jacobian product directly. The VJP for matmul, ReLU, and softmax+cross-entropy are the three you'll use ten thousand times.
§2 The backprop algorithm — tape, topo, memory
Backprop = forward pass records a tape; backward pass walks the tape in reverse, calling per-op VJPs and accumulating gradients into each parameter. Activation memory (saved forward intermediates) dominates parameter memory at scale; gradient checkpointing trades recompute for memory.
§3 Vector autograd in 250 lines — train a tiny NN
Real working code. ~250 lines of C implementing tape + Tensor + four ops (matmul, add_bias, ReLU, softmax+CE) + SGD. Train it on the classic XOR problem — the model that single-layer perceptrons famously CANNOT learn, but a 2-layer MLP solves to perfection in 50 epochs.