Partials, gradient, Jacobian
Real ML functions take vectors as input and produce vectors (or scalars) as output. A loss takes 10⁸-dimensional parameter vectors and produces one number. A transformer layer takes a vector and produces a vector of the same shape. The one-variable derivative from §1 isn’t enough — you need to talk about sensitivity to one input at a time (partials), the vector that bundles them (gradient), and the matrix that bundles them when there are many outputs (Jacobian). Each of these is the right generalisation; each shows up unchanged in backprop. And — because Ch.2 §1 already taught you that “matrix = function” — the Jacobian is exactly the matrix that is the local linear approximation to a possibly-nonlinear function near a point.
Partial derivatives — sensitivity to one input
For a function f : ℝⁿ → ℝ of n inputs, the partial derivative with respect to xⱼ is just the one-variable derivative, with every other input held fixed:
where eⱼ is the j-th standard basis vector. The notation switches from d to ∂ (“partial”) to signal that the variable being differentiated is one of several. Computationally, you fix all inputs to their current values, treat f as a one-variable function of xⱼ, and differentiate.
Three partials of f(x, y, z) = x² + 3xy + z⁴:
That’s all there is to it mechanically. The interesting part is bundling them.
The gradient
The gradient of a scalar-output function f : ℝⁿ → ℝ is the vector of its partials:
It lives in the same space as the input (ℝⁿ) and has two compact geometric meanings:
- Direction of steepest ascent. Among all unit directions u ∈ ℝⁿ, the one that maximises the directional derivative u · ∇f is u = ∇f / ‖∇f‖. The dot-product picture from Ch.1 §1.2 returns: maximising u · ∇f with ‖u‖ = 1 is exactly “point u along ∇f.”
- Perpendicular to level sets. Walking along a direction perpendicular to ∇f keeps f approximately constant (the directional derivative is zero). So the gradient is normal to the level sets f(x) = c wherever those level sets are smooth.
Both of these show up in the viz:
Drag the point. Try the four functions. Watch the bowl have a gradient that always points outward (away from the minimum); the saddle’s gradient that has no zero direction (positive x, negative y partials); Rosenbrock’s gradient that hugs the famous curving valley; the sine-cosine ripples whose gradient lives on a periodic grid. The hint arrows (teal) show the gradient field sampled across the whole domain — every arrow points uphill, never crosses a level curve, and is always perpendicular to the local contour.
This is the picture behind gradient descent: at each step, walk in the direction −∇f (steepest descent) by some step size. We’ll see the full algorithm in Ch.8.
∇f at a point is a vector that points in the direction of steepest ascent, with length equal to the rate of steepest ascent. Equivalently: it’s perpendicular to the level curves f(x, y) = const and points uphill across them.
Gradient descent moves in the opposite direction (−∇f) — taking small steps downhill — to find a local minimum of f.
The Jacobian — gradient, generalised
When the output is a vector — f : ℝⁿ → ℝᵐ — there’s no single gradient. There’s m gradients, one per output component. Stacking them gives a matrix: the Jacobian:
This is an m × n matrix — and from Ch.2 §1, that’s exactly the shape of a linear function from ℝⁿ → ℝᵐ. Which is intentional. The Jacobian is the linear function that best approximates f near a point:
This is the multivariate Taylor expansion truncated after the linear term. The leftover O(‖Δx‖²) error is what makes f nonlinear; the linear part is the Jacobian’s job.
“Matrix = function,” now with calculus. Ch.2 §1 said matrices are linear functions and their columns are the images of basis vectors. The Jacobian is the matrix of the linear function that best approximates a possibly nonlinear function near a point. So the column picture extends: column j of J(f)(x) is the rate at which f changes when you nudge input j. The j-th basis vector lands at the j-th column — same picture, different function. Backprop in Ch.9 is just chain-rule composition of these Jacobians at every layer.
Shapes always matter — and now you have the rule
Confusion between rows and columns of the Jacobian eats many ML practitioners. The rule that prevents it:
If you remember that Jacobians multiply by input vectors on the right (J · Δx produces an output-shaped vector), the convention falls out: J · Δx requires J to have as many columns as Δx has rows — i.e., n columns. The number of rows is whatever the output dimension is. Output dims = rows, input dims = columns. Same convention as nn.Linear(in_features, out_features), which stores a weight of shape (out, in) — see Ch.2 §1’s PyTorch grounding.
J(f) has shape 200 × 100 (m × n = output dims × input dims).
Δx is a 100-vector (input space). J·Δx is a (200 × 100) · (100 × 1) matrix–vector product → a 200-vector, living in the output space. Read it as: “apply the local linear approximation of f to this input nudge, get the corresponding output nudge.”
The Jacobian-vector product (JVP) is a primitive operation in every autodiff framework precisely because this shape pattern is the universal way to push small input changes through a nonlinear function to first order.
Make it run — analytical vs numerical
The numerical gradient from §1 generalises naturally: take the partial derivative numerically along each input axis. The kernel checks an analytical gradient against the numerical one for f(x, y) = (1 − x)² + 10(y − x²)² (the Rosenbrock function — the same one in the viz):
int main(void) {
struct { double x, y; const char* label; } pts[] = {
{ 0.0, 0.0, "origin" },
{ 1.0, 1.0, "minimum (1, 1)" },
{ -0.5, 1.5, "off-axis" },
{ 1.2, 1.0, "in the valley" },
};
int n = sizeof(pts) / sizeof(pts[0]);
printf("Rosenbrock f(x, y) = (1 − x)² + 10 (y − x²)²\n");
printf("%-22s %-22s %-22s %s\n",
"point", "∇f analytical", "∇f numerical", "‖∇f‖");
for (int i = 0; i < n; i++) {
double gx, gy, ngx, ngy;
grad(pts[i].x, pts[i].y, &gx, &gy);
grad_numerical(pts[i].x, pts[i].y, &ngx, &ngy);
double mag = sqrt(gx * gx + gy * gy);
printf(" %s\n", pts[i].label);
printf(" (%5.2f,%5.2f) (%8.4f,%8.4f) (%8.4f,%8.4f) %.4f\n",
pts[i].x, pts[i].y, gx, gy, ngx, ngy, mag);
double abs_err = sqrt((gx - ngx) * (gx - ngx) + (gy - ngy) * (gy - ngy));
if (abs_err > 1e-4) {
fprintf(stderr, " -> analytical/numerical disagree by %.3e\n", abs_err);The output speaks plainly:
Rosenbrock f(x, y) = (1 − x)² + 10 (y − x²)²
point ∇f analytical ∇f numerical ‖∇f‖
origin
( 0.00, 0.00) ( -2.0000, 0.0000) ( -2.0000, 0.0000) 2.0000
minimum (1, 1)
( 1.00, 1.00) ( -0.0000, 0.0000) ( 0.0000, 0.0000) 0.0000
off-axis
(-0.50, 1.50) ( 22.0000, 25.0000) ( 22.0000, 25.0000) 33.3017
in the valley
( 1.20, 1.00) ( 21.5200, -8.8000) ( 21.5200, -8.8000) 23.2497
analytical ≈ numerical at every test point (within 1e-4)
‖∇f(1,1)‖ ≈ 0 confirms (1,1) is a minimum (gradient vanishes)
Two things to notice. The gradient vanishes at the minimum — ∇f(1,1) ≈ 0 — which is the calculus statement of “you’re at a critical point.” Gradient descent stops moving here. The magnitude is large in steep regions — at (-0.5, 1.5), off in the rising wall of the function, ‖∇f‖ ≈ 33. The norm of the gradient is the local “steepness” — a meaningful diagnostic during training (huge gradient norms = your training is exploding; tiny ones = your loss landscape is flat = nothing is learning).
This pattern — analytical gradient agrees with numerical gradient to within roundoff — is precisely the gradient check every framework uses to test its custom backward implementations. PyTorch’s torch.autograd.gradcheck does exactly this against any custom Function.
The Jacobian J(f)(x) represents the best linear approximation of f near x. Concretely: for small input perturbations Δx,
f(x + Δx) ≈ f(x) + J(f)(x) · Δx.
So J(f)(x) is the matrix that turns “small change in input” into “small change in output,” to first order.
The j-th column of J is the image of the j-th basis vector under this linear function:
J · eⱼ = (∂f₁/∂xⱼ, ∂f₂/∂xⱼ, …, ∂fₘ/∂xⱼ)ᵀ.
That’s the rate at which the whole output vector changes when you nudge input j alone. The picture is: take input j, perturb it by a small dt; the output’s response trajectory has velocity vector = column j of J. Reading the Jacobian column-by-column tells you “how each input affects the output.” Reading it row-by-row tells you “how each output depends on the inputs” — these are the m gradients of the m output components. Same matrix, two readings, both useful in backprop.
END OF CH.4 §2 — Partials, gradient, Jacobian.
Built: GradientField viz (heatmap + gradient arrows; drag to query ∇f anywhere on four function landscapes); grad_check.c verifies the analytical Rosenbrock gradient against the numerical one and shows that ‖∇f‖ → 0 at the known minimum. Three recall items.
Coming next: §4.3 — The chain rule. Composing local Jacobians is the algebra under backprop.