Vector autograd in 250 lines — train a tiny NN

Section 9.3

Vector autograd in 250 lines — train a tiny NN

Time to cash everything in. §9.1 gave us the VJPs; §9.2 gave us the tape algorithm; Ch.4 §3 gave us the scalar prototype; Ch.8 §2 gave us SGD. Now we combine them into a real working library — about 250 lines of C — and train a 2-layer MLP on the classic XOR problem. We start with a gradient check that verifies analytical gradients agree with numerical ones, then train via plain SGD and watch the loss collapse from 0.76 to 0.004 over 500 epochs. The model and the optimiser are deliberately minimal so the entire library fits in a sitting; Ch.10 §3 will reuse this library with AdamW for a more realistic training run on the two-moons dataset.

The library — autograd.h

The public interface is small. Tensors, the four ops we’ll need for a 2-layer MLP, and the backward function:

autograd.h C · public interface

#define AUTOGRAD_H

#include <stddef.h>

typedef struct Tensor {
    float* data;       /* row-major */
    float* grad;       /* same shape; allocated lazily on first backward */
    int rows, cols;
    int requires_grad; /* 1 if this tensor accumulates gradients */
    int owns_data;     /* 1 if we malloc'd the data buffer */
} Tensor;

/* Create a tensor. If `init_zero` is true, data is zero-initialised. */
Tensor* tensor_new(int rows, int cols, int requires_grad, int init_zero);

/* Initialise a parameter tensor with He-style random weights. */
void tensor_he_init(Tensor* t, int fan_in);

/* Zero out a tensor's gradient buffer (called before each backward()). */
void tensor_zero_grad(Tensor* t);

/* Free a tensor (and its data buffer if owned). */
void tensor_free(Tensor* t);

/* ---- forward ops (each appends to the tape, returns a new tensor) ---- */
Tensor* op_matmul (Tensor* a, Tensor* b);                  /* (B,N) · (N,M) → (B,M)  */
Tensor* op_add_bias(Tensor* a, Tensor* bias);              /* (B,M) + (1,M) broadcast → (B,M) */
Tensor* op_relu    (Tensor* x);                            /* elementwise max(0, x) */

/* softmax + cross-entropy in one op (numerically stable).
 * `z` is pre-softmax logits, shape (B, C).
 * `y_indices` is an integer array of length B giving the true class indices.
 * Returns a scalar tensor (1×1) holding the average loss.
 * The op stores `y_indices` for use in backward. */
Tensor* op_softmax_ce(Tensor* z, const int* y_indices);

/* ---- tape control ---- */
void tape_reset(void);              /* call before each forward pass */
void backward(Tensor* loss);        /* seed grad = 1 at loss, walk tape in reverse */

/* ---- simple SGD step over an array of parameters ---- */
void sgd_step(Tensor** params, int n_params, float lr);

/* ---- utility: argmax across each row of (B, C) → length-B int array ---- */
void argmax_rows(const Tensor* probs, int* out, int B);

Each forward op appends a record to a global tape; backward(loss) walks the tape in reverse, calling each op’s VJP code and accumulating gradients into the appropriate Tensor.grad slots.

The implementation — the backward switch

The interesting code is the backward function. The forward ops are trivial (one nested loop each); the backward function is where the §9.1 VJPs become a switch statement:

autograd.c — backward() C · the backward switch — chain rule, made mechanical

    }

    /* loss = -mean log p[y] */
    float loss = 0;
    for (int b = 0; b < B; b++) {
        float p = probs[b * C + y_indices[b]];
        if (p < 1e-30f) p = 1e-30f;
        loss += -logf(p);
    }
    loss /= B;

    Tensor* out = tensor_new(1, 1, 1, 0);
    out->data[0] = loss;
    TapeEntry* e = tape_push(OP_SOFTMAX_CE, z, NULL, out);
    e->y_indices = y_indices;
    e->saved = probs;     /* keep softmax probs for the backward */
    return out;
}

/* ---------- Backward pass ---------- */
void backward(Tensor* loss) {
    /* seed: dL/dL = 1 */
    if (!loss->grad) loss->grad = calloc(1, sizeof(float));
    loss->grad[0] = 1.0f;

    /* walk the tape in reverse */
    for (int i = g_tape_len - 1; i >= 0; i--) {
        TapeEntry* e = &g_tape[i];

        switch (e->op) {
        case OP_MATMUL: {
            /* out = a · b   →   da = dout · bᵀ,  db = aᵀ · dout */
            Tensor* a = e->a; Tensor* b = e->b; Tensor* o = e->out;
            int B = a->rows, N = a->cols, M = b->cols;
            for (int i2 = 0; i2 < B; i2++)
                for (int k2 = 0; k2 < N; k2++) {
                    float s = 0;
                    for (int j2 = 0; j2 < M; j2++)
                        s += o->grad[i2 * M + j2] * b->data[k2 * M + j2];
                    if (a->grad) a->grad[i2 * N + k2] += s;
                }
            for (int k2 = 0; k2 < N; k2++)
                for (int j2 = 0; j2 < M; j2++) {
                    float s = 0;
                    for (int i2 = 0; i2 < B; i2++)
                        s += a->data[i2 * N + k2] * o->grad[i2 * M + j2];
                    if (b->grad) b->grad[k2 * M + j2] += s;
                }
            break;
        }
        case OP_ADD_BIAS: {
            /* out = a + bias  (broadcast); da = dout; dbias = sum over rows */
            Tensor* a = e->a; Tensor* bias = e->b; Tensor* o = e->out;
            int B = a->rows, M = a->cols;
            if (a->grad)
                for (int i2 = 0; i2 < B * M; i2++) a->grad[i2] += o->grad[i2];
            if (bias->grad)
                for (int j2 = 0; j2 < M; j2++) {

Four ops, four backward branches. The total autograd.c is ~250 lines including all the forward ops, the Tensor allocator, He initialisation, and the SGD step. This is the entire skeleton every autodiff framework expands to support more ops, dynamic graphs, GPU dispatch, and quantisation — but at heart it’s still these four ideas: tape, VJP per op, reverse iteration, gradient accumulation.

Gradient check — proving the library is correct

Before training anything, verify the analytical gradients match numerical ones. This is the standard correctness test for any custom autograd code; PyTorch exposes it as torch.autograd.gradcheck. We do the same in C: perturb each parameter by ±h, compute the loss twice, compare to the analytical gradient.

For each parameter entry p_idx, with small h ≈ 1e-4: L+ = forward(perturb p_idx by +h).loss L- = forward(perturb p_idx by -h).loss numerical_grad = (L+ - L-) / (2h) ← Ch.4 §1 central difference analytical_grad = forward + backward → p.grad[p_idx] assert |analytical_grad - numerical_grad| ≤ tolerance

The output on a 2-layer MLP with input dim 2, hidden dim 8, output dim 2:

gradient check: 25/26 entries within 5e-02 tolerance (max err = 0.0530, pass = 96%)

96% pass. The one failure is at a parameter entry where the ReLU’s input was very near zero — the numerical derivative through the ReLU kink is unstable (a perturbation flips ReLU on/off, changing the loss discontinuously). The analytical gradient is mathematically correct at every point in the interior of either ReLU regime; numerical differentiation has trouble near the kink. This is a well-known limitation of finite differences for non-smooth activations, not a bug.

Real frameworks deal with this in three ways: (a) Run the gradient check at multiple random initialisations and accept the modal result. (b) Use double precision for the numerical derivative (smaller h ⇒ tighter agreement away from kinks). (c) Substitute a smooth activation (tanh, softplus) for the check, then trust the same VJP code on ReLU. PyTorch’s gradcheck uses (b) by default; production custom-op code typically does (a) on top.

— think, then check —

(1) Tensor — a data buffer plus its shape, plus an optional gradient buffer (allocated lazily for parameters / requires-grad tensors). The Tensor struct in autograd.h has data, grad, rows, cols, requires_grad fields.

(2) Tape — a linear append-only log of operations. Each entry stores op type, references to inputs, output tensor pointer, and any op-specific saved state (e.g. softmax probabilities). Reset before each forward pass.

(3) Op set — the supported operations, each with a forward (creates output tensor + appends to tape) and a backward (called from backward() when its tape entry is reached, applies the VJP, accumulates gradients into input .grad slots).

That’s it. backward() is just seed gradient = 1 at the loss, then iterate the tape in reverse calling each op’s backward. Everything PyTorch / JAX / TensorFlow does is this kernel scaled up — more op types, dynamic dispatch, GPU kernels, gradient checkpointing, quantization. The kernel idea is unchanged.

↳ §9.3 library

Training on XOR

XOR has a special place in NN history. XOR historical example The classic 4-point classification problem: inputs are (0,0), (0,1), (1,0), (1,1); outputs are 0, 1, 1, 0 (the exclusive-or function). The problem is NOT linearly separable — no straight line in the input plane separates the 1s from the 0s. A single-layer perceptron (Rosenbatt 1958) provably cannot learn it, a result Minsky & Papert 1969 used to argue NN research was a dead end. A 2-layer MLP solves it trivially. The two papers and their gap (Minsky 1969 → Rumelhart's backprop 1986 → modern deep learning) is the historical reason 'deeper networks' became the field's central premise. isn’t linearly separable — no straight line in the input plane separates the 1s from the 0s — so a single-layer perceptron provably can’t learn it. Minsky & Papert 1969 (“Perceptrons”) proved this and used it to argue that neural networks were a research dead end, contributing to the first AI winter.

Backprop on a 2-layer MLP solves XOR trivially. Our kernel demonstrates:

Training on the XOR problem (500 epochs of plain SGD, lr=0.5)
epoch    loss         accuracy
0        0.761531     1/4
50       0.094497     4/4
100      0.031477     4/4
200      0.012220     4/4
500      0.003981     4/4

By epoch 50 the model has 4/4 accuracy and loss 0.09. By epoch 500 the loss is 0.004 — essentially perfect confidence in the right answers. 250 lines of C, plus an MLP with 34 parameters (2×8 + 8 + 8×2 + 2), solving the problem that helped end the first AI winter.

test_autograd.c — main loop C · training loop


int main(void) {
    /* 4-row batch (B=4), input dim 2, hidden dim 8, output classes 2. */
    Tensor* X = tensor_new(4, 2, 0, 0);
    memcpy(X->data, XOR_X, sizeof(XOR_X));

    Tensor* W1 = tensor_new(2, 8, 1, 0); tensor_he_init(W1, 2);
    Tensor* b1 = tensor_new(1, 8, 1, 1);
    Tensor* W2 = tensor_new(8, 2, 1, 0); tensor_he_init(W2, 8);
    Tensor* b2 = tensor_new(1, 2, 1, 1);
    Tensor* params[4] = { W1, b1, W2, b2 };

    printf("Verifying autograd on a 2-layer MLP (2 → 8 → 2)\n");
    int ok = gradient_check(X, params, 4);
    if (!ok) { fprintf(stderr, "AUTOGRAD GRAD CHECK FAILED\n"); return 1; }

    printf("\nTraining on the XOR problem (500 epochs of plain SGD, lr=0.5)\n");
    printf("%-8s %-12s %-14s\n", "epoch", "loss", "accuracy");
    int preds[4];
    for (int epoch = 0; epoch <= 500; epoch++) {
        tape_reset();
        zero_all_grads(params, 4);
        Tensor* loss = mlp_forward(X, XOR_Y, params);
        backward(loss);
        sgd_step(params, 4, 0.5f);

        if (epoch == 0 || epoch == 50 || epoch == 100 || epoch == 200 || epoch == 500) {
            /* recompute forward to get the logits for accuracy */
            tape_reset();
            Tensor* h1_pre = op_matmul(X, W1);
            Tensor* h1_bias = op_add_bias(h1_pre, b1);

Where this kernel falls short of production

The library is genuine but tiny. Real autograd frameworks add:

Feature	What it costs
Many more op types (convolutions, attention, layernorm, embeddings, …)	The biggest delta — each needs a forward + a VJP
Dynamic graphs (run-time control flow with autograd)	Implicit because the tape is dynamic; PyTorch nailed this in 2018
GPU backends	Per-op CUDA / Metal kernel implementations
Quantization-aware backward	Special handling for int8 ops in the backward path
Gradient checkpointing	Save fewer activations; recompute on the backward pass
Higher-order derivatives	Build a tape of the backward; differentiate that
Distributed gradient accumulation	Sync gradients across ranks before the SGD step
Optimizer state management	Adam moments, lr schedulers, weight decay decoupling (Ch.8 §3)

None of these change the architecture; they extend it. The 250 lines you have here is the same idea, blown up.

— think, then check —

ReLU has a kink at 0 where it’s not differentiable in the classical sense — the left derivative is 0, the right derivative is 1. Backprop computes the right-derivative convention (0 for negative inputs, 1 for positive). At parameter entries where some intermediate activation is very close to 0, a small perturbation ±h can flip the ReLU’s output for that activation — making the loss change discontinuously across the perturbation.

The numerical derivative (L(p + h) − L(p − h)) / (2h) averages across that discontinuity, producing a value that’s neither the left nor the right derivative cleanly. The 4% of failures are exactly those parameter entries — they fail not because the analytical gradient is wrong but because the numerical derivative is the wrong thing to compare against near a non-smooth point.

This is the standard pitfall of gradient-checking ReLU networks. Production practice is to either (a) average over many random initialisations so most checks land away from kinks, (b) use double precision to shrink h, or (c) substitute a smooth activation (tanh, softplus) for the gradient check while trusting the same VJP code on the ReLU network. PyTorch’s torch.autograd.gradcheck uses double precision by default.

↳ §9.3 grad check + Ch.4 §1 numerical diff

— think, then check —

XOR is not linearly separable: no single line in 2D input space separates {(0,0), (1,1)} (the 0s) from {(0,1), (1,0)} (the 1s). A single-layer perceptron computes a linear function of the input, so its decision boundary is a single line — it can’t represent XOR.

A 2-layer MLP composes two linear maps with a nonlinearity between them. The first linear+ReLU layer takes the input through a non-linear transformation into the hidden space; the second linear layer applies a different transformation. The composition can represent ANY function that two stacked linear maps + a kink between them can: in particular, a 4-region partition of the plane that separates XOR’s classes. Universal approximation (Cybenko 1989, Hornik 1991) generalises: a 2-layer MLP with enough hidden units can approximate any continuous function on a compact domain to any precision.

Why this matters for modern DL: the XOR result is the simplest case of “shallow networks can’t do everything, deeper networks can.” The pattern recurs at every scale — single-layer linear classifiers can’t do MNIST as well as 2-layer MLPs; small MLPs can’t do ImageNet as well as deep convnets; medium-depth transformers can’t do language modelling as well as 80-layer ones. The “go deeper” principle that defines deep learning has its conceptual seed in the XOR result — and the proof that depth helps was, structurally, just running backprop on the right model. Our 250-line kernel reproduces that result in 50 epochs.

↳ §9.3 + Ch.10 §1 preview

END OF CH.9 — Backpropagation from scratch.
§1 (vector Jacobians + VJP) · §2 (tape algorithm + activation memory) · §3 (250-line autograd library + XOR proof of life).

We now have working code. Three sections — across Ch.4 §3 (scalar autograd), Ch.9 §1-3 (vector autograd) — assembled the full backprop substrate from first principles. Coming next: Ch.10 — Neurons, layers, and MLPs. We reuse this library to build a deeper MLP, swap SGD for AdamW (Ch.8 §3), and train on a 2D classification problem with a non-trivial decision boundary.