Autodiff, the computation graph, why frameworks exist. Kernel: a tiny hand-written autograd.
Scalar autograd from Ch.4 §3 generalises to vector/matrix ops by replacing each scalar derivative with a Jacobian. The trick: never materialise the Jacobian — implement its vector-Jacobian product directly. The VJP for matmul, ReLU, and softmax+cross-entropy are the three you'll use ten thousand times.
Backprop = forward pass records a tape; backward pass walks the tape in reverse, calling per-op VJPs and accumulating gradients into each parameter. Activation memory (saved forward intermediates) dominates parameter memory at scale; gradient checkpointing trades recompute for memory.
Real working code. ~250 lines of C implementing tape + Tensor + four ops (matmul, add_bias, ReLU, softmax+CE) + SGD. Train it on the classic XOR problem — the model that single-layer perceptrons famously CANNOT learn, but a 2-layer MLP solves to perfection in 50 epochs.
← ALL CHAPTERS