Partials, gradient, Jacobian

Section 4.2

Partials, gradient, Jacobian

Real ML functions take vectors as input and produce vectors (or scalars) as output. A loss takes 10⁸-dimensional parameter vectors and produces one number. A transformer layer takes a vector and produces a vector of the same shape. The one-variable derivative from §1 isn’t enough — you need to talk about sensitivity to one input at a time (partials), the vector that bundles them (gradient), and the matrix that bundles them when there are many outputs (Jacobian). Each of these is the right generalisation; each shows up unchanged in backprop. And — because Ch.2 §1 already taught you that “matrix = function” — the Jacobian is exactly the matrix that is the local linear approximation to a possibly-nonlinear function near a point.

Partial derivatives — sensitivity to one input

For a function f : ℝⁿ → ℝ of n inputs, the partial derivative with respect to xⱼ is just the one-variable derivative, with every other input held fixed:

∂f/∂xⱼ (at x) = lim h→0 ( f(x + h·eⱼ) − f(x) ) / h

where eⱼ is the j-th standard basis vector. The notation switches from d to ∂ (“partial”) to signal that the variable being differentiated is one of several. Computationally, you fix all inputs to their current values, treat f as a one-variable function of xⱼ, and differentiate.

Three partials of f(x, y, z) = x² + 3xy + z⁴:

∂f/∂x = 2x + 3y ← y, z held fixed; differentiate in x only ∂f/∂y = 3x ← x, z held fixed ∂f/∂z = 4 z³ ← x, y held fixed

That’s all there is to it mechanically. The interesting part is bundling them.

The gradient

The gradient core term ∇f(x) — the vector of all partial derivatives, ∇f = (∂f/∂x₁, …, ∂f/∂xₙ). For a scalar-output function, this is the natural multi-input generalisation of the derivative. Geometrically: points in the direction of steepest ascent. Then → now: the same definition since the 19th century. What changed is that 'gradient' is now an everyday word in software engineering — every gradient-descent optimiser, every backprop step, every line of training code is built around them. of a scalar-output function f : ℝⁿ → ℝ is the vector of its partials:

∇f(x) = ( ∂f/∂x₁, ∂f/∂x₂, …, ∂f/∂xₙ )ᵀ

It lives in the same space as the input (ℝⁿ) and has two compact geometric meanings:

Direction of steepest ascent. Among all unit directions u ∈ ℝⁿ, the one that maximises the directional derivative u · ∇f is u = ∇f / ‖∇f‖. The dot-product picture from Ch.1 §1.2 returns: maximising u · ∇f with ‖u‖ = 1 is exactly “point u along ∇f.”
Perpendicular to level sets. Walking along a direction perpendicular to ∇f keeps f approximately constant (the directional derivative is zero). So the gradient is normal to the level sets f(x) = c wherever those level sets are smooth.

Both of these show up in the viz:

point (0.700, 0.500)

f 0.740

∇f (1.400, 1.000)

‖∇f‖ 1.720

The orange arrow at the dragged point is ∇f(x, y) — pointing in the direction of steepest ascent. The faint teal arrows sample the same gradient field everywhere. Notice how they always point uphill and are perpendicular to the (implicit) contour lines of constant f.

For a scalar field f : ℝ² → ℝ, the gradient ∇f is the vector whose components are the partial derivatives. Geometrically it points uphill; its length is the rate of steepest ascent. Gradient descent simply walks in the −∇f direction.

Drag the point. Try the four functions. Watch the bowl have a gradient that always points outward (away from the minimum); the saddle’s gradient that has no zero direction (positive x, negative y partials); Rosenbrock’s gradient that hugs the famous curving valley; the sine-cosine ripples whose gradient lives on a periodic grid. The hint arrows (teal) show the gradient field sampled across the whole domain — every arrow points uphill, never crosses a level curve, and is always perpendicular to the local contour.

This is the picture behind gradient descent: at each step, walk in the direction −∇f (steepest descent) by some step size. We’ll see the full algorithm in Ch.8.

— think, then check —

∇f at a point is a vector that points in the direction of steepest ascent, with length equal to the rate of steepest ascent. Equivalently: it’s perpendicular to the level curves f(x, y) = const and points uphill across them.

Gradient descent moves in the opposite direction (−∇f) — taking small steps downhill — to find a local minimum of f.

↳ §4.2 gradient geometry

The Jacobian — gradient, generalised

When the output is a vector — f : ℝⁿ → ℝᵐ — there’s no single gradient. There’s m gradients, one per output component. Stacking them gives a matrix: the Jacobian core term J(f) ∈ ℝᵐˣⁿ — the matrix of all partial derivatives of a vector-valued function f : ℝⁿ → ℝᵐ. Row i is the gradient of output component i; column j is the partial of the whole output vector w.r.t. input j. Reduces to the gradient (as a row vector) when m = 1. Then → now: same matrix Jacobi defined in 1841. What changed: it's now the canonical object backprop manipulates. Every nn.Linear, nn.Conv, attention, and softmax has a known Jacobian; backprop multiplies these Jacobians backward through the network. :

┌ ┐ │ ∂f₁/∂x₁ ∂f₁/∂x₂ … ∂f₁/∂xₙ │ ← gradient of f₁ J(f)(x) = │ ∂f₂/∂x₁ ∂f₂/∂x₂ … ∂f₂/∂xₙ │ ← gradient of f₂ │ ⋮ │ │ ∂fₘ/∂x₁ … ∂fₘ/∂xₙ │ ← gradient of fₘ └ ┘ Shape: m rows, n columns — exactly the shape that turns input nudges into output nudges.

This is an m × n matrix — and from Ch.2 §1, that’s exactly the shape of a linear function from ℝⁿ → ℝᵐ. Which is intentional. The Jacobian is the linear function that best approximates f near a point:

f(x + Δx) ≈ f(x) + J(f)(x) · Δx ← linear approximation (small nudge of input) ⇒ (small nudge of output) given by Jacobian-vector product

This is the multivariate Taylor expansion truncated after the linear term. The leftover O(‖Δx‖²) error is what makes f nonlinear; the linear part is the Jacobian’s job.

“Matrix = function,” now with calculus. Ch.2 §1 said matrices are linear functions and their columns are the images of basis vectors. The Jacobian is the matrix of the linear function that best approximates a possibly nonlinear function near a point. So the column picture extends: column j of J(f)(x) is the rate at which f changes when you nudge input j. The j-th basis vector lands at the j-th column — same picture, different function. Backprop in Ch.9 is just chain-rule composition of these Jacobians at every layer.

Shapes always matter — and now you have the rule

Confusion between rows and columns of the Jacobian eats many ML practitioners. The rule that prevents it:

f : ℝⁿ → ℝᵐ ⇒ J(f) has shape m × n (output dims × input dims)

If you remember that Jacobians multiply by input vectors on the right (J · Δx produces an output-shaped vector), the convention falls out: J · Δx requires J to have as many columns as Δx has rows — i.e., n columns. The number of rows is whatever the output dimension is. Output dims = rows, input dims = columns. Same convention as nn.Linear(in_features, out_features), which stores a weight of shape (out, in) — see Ch.2 §1’s PyTorch grounding.

— think, then check —

J(f) has shape 200 × 100 (m × n = output dims × input dims).

Δx is a 100-vector (input space). J·Δx is a (200 × 100) · (100 × 1) matrix–vector product → a 200-vector, living in the output space. Read it as: “apply the local linear approximation of f to this input nudge, get the corresponding output nudge.”

The Jacobian-vector product (JVP) is a primitive operation in every autodiff framework precisely because this shape pattern is the universal way to push small input changes through a nonlinear function to first order.

↳ §4.2 Jacobian shape

Make it run — analytical vs numerical

The numerical gradient from §1 generalises naturally: take the partial derivative numerically along each input axis. The kernel checks an analytical gradient against the numerical one for f(x, y) = (1 − x)² + 10(y − x²)² (the Rosenbrock function — the same one in the viz):

grad_check.c (loop) C · analytical ∇f vs central-difference ∇f

int main(void) {
    struct { double x, y; const char* label; } pts[] = {
        { 0.0,  0.0,  "origin" },
        { 1.0,  1.0,  "minimum (1, 1)" },
        { -0.5, 1.5,  "off-axis" },
        { 1.2,  1.0,  "in the valley" },
    };
    int n = sizeof(pts) / sizeof(pts[0]);

    printf("Rosenbrock f(x, y) = (1 − x)² + 10 (y − x²)²\n");
    printf("%-22s %-22s %-22s %s\n",
           "point", "∇f analytical", "∇f numerical", "‖∇f‖");
    for (int i = 0; i < n; i++) {
        double gx, gy, ngx, ngy;
        grad(pts[i].x, pts[i].y, &gx, &gy);
        grad_numerical(pts[i].x, pts[i].y, &ngx, &ngy);
        double mag = sqrt(gx * gx + gy * gy);
        printf("  %s\n", pts[i].label);
        printf("    (%5.2f,%5.2f)        (%8.4f,%8.4f)   (%8.4f,%8.4f)   %.4f\n",
               pts[i].x, pts[i].y, gx, gy, ngx, ngy, mag);
        double abs_err = sqrt((gx - ngx) * (gx - ngx) + (gy - ngy) * (gy - ngy));
        if (abs_err > 1e-4) {
            fprintf(stderr, "  -> analytical/numerical disagree by %.3e\n", abs_err);

The output speaks plainly:

Rosenbrock f(x, y) = (1 − x)² + 10 (y − x²)²
point                  ∇f analytical         ∇f numerical          ‖∇f‖
  origin
    ( 0.00, 0.00)        ( -2.0000,  0.0000)   ( -2.0000,  0.0000)   2.0000
  minimum (1, 1)
    ( 1.00, 1.00)        ( -0.0000,  0.0000)   (  0.0000,  0.0000)   0.0000
  off-axis
    (-0.50, 1.50)        ( 22.0000, 25.0000)   ( 22.0000, 25.0000)   33.3017
  in the valley
    ( 1.20, 1.00)        ( 21.5200, -8.8000)   ( 21.5200, -8.8000)   23.2497

analytical ≈ numerical at every test point (within 1e-4)
‖∇f(1,1)‖ ≈ 0 confirms (1,1) is a minimum (gradient vanishes)

Two things to notice. The gradient vanishes at the minimum — ∇f(1,1) ≈ 0 — which is the calculus statement of “you’re at a critical point.” Gradient descent stops moving here. The magnitude is large in steep regions — at (-0.5, 1.5), off in the rising wall of the function, ‖∇f‖ ≈ 33. The norm of the gradient is the local “steepness” — a meaningful diagnostic during training (huge gradient norms = your training is exploding; tiny ones = your loss landscape is flat = nothing is learning).

This pattern — analytical gradient agrees with numerical gradient to within roundoff — is precisely the gradient check every framework uses to test its custom backward implementations. PyTorch’s torch.autograd.gradcheck does exactly this against any custom Function.

— think, then check —

The Jacobian J(f)(x) represents the best linear approximation of f near x. Concretely: for small input perturbations Δx,

f(x + Δx) ≈ f(x) + J(f)(x) · Δx.

So J(f)(x) is the matrix that turns “small change in input” into “small change in output,” to first order.

The j-th column of J is the image of the j-th basis vector under this linear function:

J · eⱼ = (∂f₁/∂xⱼ, ∂f₂/∂xⱼ, …, ∂fₘ/∂xⱼ)ᵀ.

That’s the rate at which the whole output vector changes when you nudge input j alone. The picture is: take input j, perturb it by a small dt; the output’s response trajectory has velocity vector = column j of J. Reading the Jacobian column-by-column tells you “how each input affects the output.” Reading it row-by-row tells you “how each output depends on the inputs” — these are the m gradients of the m output components. Same matrix, two readings, both useful in backprop.

↳ §4.2 + Ch.2 §1 column picture

END OF CH.4 §2 — Partials, gradient, Jacobian.
Built: GradientField viz (heatmap + gradient arrows; drag to query ∇f anywhere on four function landscapes); grad_check.c verifies the analytical Rosenbrock gradient against the numerical one and shows that ‖∇f‖ → 0 at the known minimum. Three recall items.
Coming next: §4.3 — The chain rule. Composing local Jacobians is the algebra under backprop.