The chain rule — engine of backprop

Section 4.3

The chain rule — engine of backprop

Three small results, taken together, are the entire substrate of training every neural network. §1 said derivatives are sensitivity. §2 said that for multivariable functions the right object is a Jacobian — a matrix that is the local linear approximation. This section names the third and final piece: when you compose two functions, their Jacobians compose by matrix multiplication. The chain rule, said out loud, is exactly that. The two ways to walk the resulting product — right to left vs left to right — are the two modes of automatic differentiation, forward and reverse. Reverse mode is backprop. Once you have this section, every later “how does X get optimised?” question is the same question with different functions plugged in.

The scalar chain rule

For two single-variable functions composed as y = f(g(x)):

dy/dx = f'(g(x)) · g'(x)

Read it as a sentence: “the derivative of the outer function evaluated at the inner function’s output, times the derivative of the inner function.” Operationally: a small nudge in x becomes a nudge in g(x) scaled by g’(x), which then becomes a nudge in f(g(x)) scaled by f’(g(x)). Two multiplications, applied in order.

That’s the picture the viz makes concrete. Drag x; pick g and f; watch the perturbation h get multiplied by g’(x) to become Δu, then by f’(u) to become Δy. The chain rule readout is literally that product.

choose g

choose f

x = 0.700 nudge h = 0.150

Drag x. The small perturbation h flows through g (multiplying by g'(x)) to become a perturbation in u, then through f (multiplying by f'(u)) to become a perturbation in y. The chain rule is literally just saying "multiply the local derivatives along the chain."

Composition propagates sensitivity by multiplication. dy/dx = f'(g(x)) · g'(x). No matter how complicated the network of compositions, the chain rule says: walk the path, multiply the local Jacobians. That's the entire principle behind backpropagation.

The “actual Δu” and “actual Δy” lines compare the chain-rule prediction (the multiplied derivatives) against the observed change in the composed function. They agree to first order — the difference is exactly the second-order Taylor term that linear approximation drops.

— think, then check —

dy/dx = f’(g(x)) · g’(x).

A small change Δx becomes a change Δu = g’(x)·Δx at the intermediate, which becomes a change Δy = f’(u)·Δu = f’(g(x))·g’(x)·Δx at the output. The total sensitivity of y to x is the product of the local sensitivities along the chain. No matter how many stages you stack, the principle stays: multiply the local derivatives, in order.

↳ §4.3 definition

The multivariable chain rule — Jacobians compose by matmul

When functions take and return vectors, the chain rule generalises by exactly the move you’d expect: derivatives become Jacobians, scalar multiplication becomes matrix multiplication. For f : ℝⁿ → ℝᵐ and g : ℝᵏ → ℝⁿ, the composition h(x) = f(g(x)) has Jacobian:

J(h)(x) = J(f)(g(x)) · J(g)(x) (m × k) (m × n) (n × k) ← shapes line up exactly

The shapes are forced by Ch.2 §2’s three-axis argument: the inner dimension n appears in both factors and gets contracted; the outer dimensions m (output) and k (input) survive into the composed Jacobian. This is the exact same shape rule as matmul, because the underlying operation IS matmul. The Jacobian of the composition equals the matrix product of the Jacobians.

Ch.2 §2 pays its full dividend here. “Matmul is function composition” wasn’t an analogy — it was the literal statement of the chain rule applied to linear functions. Now that we have nonlinear functions in the mix, the same identity holds with the linear functions upgraded to local linear approximations (Jacobians). Compose any chain of functions — linear or not — and the local Jacobian of the composition is the product of the local Jacobians. Backprop is exactly this matrix product, evaluated in a particular order.

Stack three functions: y = h(g(f(x))). The chain rule gives:

J(y)(x) = J(h)(...) · J(g)(...) · J(f)(x) = J₃ · J₂ · J₁ (suppressing the evaluation points for compactness)

A product of three Jacobians. Now: a matrix product is associative — you can compute (J₃ · J₂) · J₁ or J₃ · (J₂ · J₁). The math is the same. The computational cost is not.

— think, then check —

By the multivariable chain rule, J(f ∘ g)(x) = J(f)(g(x)) · J(g)(x).

J(f) has shape 200 × 100 (output dims × input dims of f).

J(g) has shape 100 × 50 (output dims × input dims of g).

The product (200 × 100) · (100 × 50) → 200 × 50. The inner 100 contracts — that’s the intermediate space the two functions agree on. The outer 200 × 50 is exactly J(f ∘ g)‘s shape, matching the fact that f ∘ g : ℝ⁵⁰ → ℝ²⁰⁰ as a function.

↳ §4.3 multivariable chain rule

Forward mode vs reverse mode

For a long chain y = fₗ(fₗ₋₁( … f₁(x) … )), you have J = Jₗ · Jₗ₋₁ · … · J₁. To compute the gradient of a scalar loss with respect to a vector of parameters, you need to evaluate this Jacobian product (or its transpose) applied to a vector. Two strategies, both correct:

Forward mode — multiply J₁ · v first, then J₂ · (J₁ · v), and so on. Each step is a Jacobian-vector product (JVP). The intermediate is always vector-shaped, which is cheap.
Reverse mode mode Walk the Jacobian product left-to-right: combine (uᵀ · Jₗ) first, then multiply that into Jₗ₋₁, and so on. Each step multiplies a row vector by a matrix (the vector-Jacobian product, VJP). Efficient when the output dimension is small (a single scalar loss). One pass computes gradients w.r.t. all inputs. Total cost ~O(forward) regardless of parameter count. Backpropagation is reverse-mode autodiff. Then → now: backprop in the 1986 Rumelhart/Hinton/Williams paper was reverse-mode autodiff for neural networks specifically. Modern autograd frameworks (PyTorch, JAX, TensorFlow) just generalise the same algorithm to arbitrary computation graphs. — multiply uᵀ · Jₗ first (where u is a chosen output direction, usually a unit basis vector picking the scalar loss), then continue right with (uᵀ · Jₗ) · Jₗ₋₁, and so on. Each step is a vector-Jacobian product (VJP). The intermediate is still vector-shaped.

Both modes contract the Jacobian product into a vector at each step, so each layer’s cost is at most O(size of intermediate) — not a full matrix multiply. The choice between modes depends on whether the input dimension or the output dimension is bigger.

For training a neural network, the input is a parameter vector of dimension 10⁸+; the output is a single scalar loss. Reverse mode wins by a factor of ~10⁸: one backward pass computes all 10⁸ parameter gradients; forward mode would require 10⁸ forward passes (one per input dimension) to do the same.

Network shape: parameters ∈ ℝⁿ → loss ∈ ℝ¹ (m = 1, n large) Forward mode cost: n · (one forward pass) ← multiply right to left, once per input direction Reverse mode cost: 1 · (one forward + one backward) ← multiply left to right, once for the one output For n = 10⁸ and a forward pass that costs C, reverse mode is ~n × cheaper.

That ratio is the entire reason deep learning is computationally possible. Without reverse-mode autodiff, training a large model would cost 10⁸× more than running inference on it. With reverse mode, training is roughly 2–3× the cost of inference. (Rumelhart, Hinton & Williams, “Learning Representations by Back-propagating Errors,” Nature 1986 — the founding paper for neural backprop.)

Now make it run — a tape autograd in 120 lines

The autograd kernel for this section builds an expression as it goes — every operation appends a (op, inputs, value) record to a flat tape. The backward pass walks the tape in reverse, multiplying local derivatives and accumulating into each node’s gradient slot. Test function: y = \sin(x²) + x³. Analytical derivative: 2x·\cos(x²) + 3x² (which we won’t actually write down anywhere — the autograd derives it).

autograd.c — backward() C · the backward pass, ~12 lines

        case POW3: v = va * va * va; break;
        default:   v = 0; break;
    }
    nodes[n_nodes] = (Node){ op, a, -1, v, 0.0 };
    return n_nodes++;
}

/* ---- backward pass ----------------------------------------------------- */
static void backward(int output) {
    for (int i = 0; i < n_nodes; i++) nodes[i].g = 0.0;
    nodes[output].g = 1.0;                         /* seed the output's grad */

    /* Walk in reverse topological order — for a tape, that's just reverse. */
    for (int i = n_nodes - 1; i >= 0; i--) {
        Node* n = &nodes[i];
        double gi = n->g;                          /* incoming grad */
        if (gi == 0.0) continue;
        switch (n->op) {
            case LEAF:                              break;   /* terminal */
            case ADD:  nodes[n->a].g += gi;
                       nodes[n->b].g += gi;        break;
            case SUB:  nodes[n->a].g += gi;
                       nodes[n->b].g -= gi;        break;
            case MUL:  nodes[n->a].g += gi * nodes[n->b].v;

The backward function is the entire chain rule, mechanised as a switch statement. For every operation type, we know the local derivative formula:

ADD: both inputs receive the incoming gradient unchanged (d/dx (x + y) = 1).
MUL: each input receives the incoming gradient times the other input’s value (d/dx (x · y) = y).
SIN: input receives incoming gradient times cos(value).
POW2: input receives incoming gradient times 2 · value.
and so on.

That’s all an autograd framework does, generalised. PyTorch has hundreds of operation types instead of nine, vector ops instead of scalar, dynamic graphs and CUDA kernels and dispatch logic — but the kernel is structurally identical. A backward function for every forward operation, applied in reverse, multiplied into running gradients via the chain rule.

The test confirms it works:

autograd: y = sin(x²) + x³        (dy/dx = 2x·cos(x²) + 3x²)
x        y              dy/dx (auto)   dy/dx (num)    abs err
-1.00    -0.158529      1.919395       1.919395       3.36e-10
0.00     0.000000       0.000000       0.000000       1.00e-10
0.50     0.372404       1.718912       1.718912       5.58e-11
1.00     1.841471       4.080605       4.080605       1.36e-10
1.50     4.153073       4.865479       4.865479       1.56e-10

auto = numerical at every point (within 1e-4)
the chain rule (10 lines of switch statement above) is all backprop is.

Five test points. The autograd gradient and the numerical gradient agree to ~10⁻¹⁰ — basically the precision floor of double-precision arithmetic for this size of problem. The kernel is small enough that you can step through it in gdb and verify the chain rule applying at each node. Ch.9 takes the same architecture, swaps in vector ops and matrix Jacobians, and that’s backprop on a neural network.

— think, then check —

The Jacobian of the loss with respect to all parameters is a 1 × 10⁸ matrix. The chain rule expresses it as a product of layer Jacobians (each huge), but we don’t need to materialise the full product — we need to apply it to a specific direction vector.

Forward mode picks an input direction eⱼ (one of the 10⁸ basis vectors of parameter space) and computes J · eⱼ, which gives one column of J (= one parameter’s gradient). To get all gradients, you’d do this 10⁸ times — 10⁸ forward passes.

Reverse mode picks an output direction (just the scalar 1, since there’s one output) and computes 1ᵀ · J = full gradient vector, in one backward pass. The asymmetry is: forward mode is efficient when the input is low-dimensional; reverse mode is efficient when the output is low-dimensional. For neural net training, output dim = 1 (loss), input dim = parameter count, so reverse is the entire game.

This is also why JAX exposes both jacfwd and jacrev — and the rule of thumb in the JAX docs is precisely “jacfwd if n < m, jacrev if m < n.” The chain rule’s associativity is mathematical; the choice of grouping is engineering.

↳ §4.3 forward/reverse modes

END OF CH.4 — Calculus & gradients refreshed.
END OF PART I — Foundations Refreshed.

§1 (derivatives as sensitivity · numerical U-curve) · §2 (gradient, Jacobian as best linear approx) · §3 (chain rule + tape autograd in 120 lines).

Part I gave you vectors and dot products (Ch.1), matrices as functions (Ch.2), floats and quantization (Ch.3), and the calculus of multivariable optimisation (Ch.4). Every later chapter — attention, normalisation, training-at-scale, quantized inference, vector search — uses these four pieces in combination.

Coming next: Part II — Probability, Geometry & Learning. Ch.5 begins with distributions and variance, the substrate for understanding why anything trains in high dimensions.