MATRICES AS TRANSFORMATIONS
Section 2.1
01

Matrices as functions

You have written compose(f, g)(x) = f(g(x)) ten thousand times. You internalized function composition in your first language, made it the spine of your pipelines, and stopped thinking about it. Now I’m going to claim something that may sound off: that is what matrix multiplication is. And the matrix itself — the 2D array you’ve shoved through linear algebra problems — is just one way to write down a function. The data-structure mental model is the lie this section asks you to drop.

A matrix is a linear function from ℝⁿ to ℝᵐ. That’s the structural claim, full stop. The array of numbers you’ve written for years is one notation for that function — convenient because the function is uniquely determined by a small amount of data — but the function is the thing. Everything in linear algebra that’s ever confused you (why this dimension goes where, why composition orders right-to-left, why eigenvectors matter) follows from taking that one claim seriously.

What “linear” means, operationally

A function f: ℝⁿ → ℝᵐ is linear iff it respects scaling and addition:

f(α x + β y) = α f(x) + β f(y) for all vectors x, y and scalars α, β

Read it as two promises:

These two together are what we mean by linear. Most functions you’ve ever written are not linear — sin, exp, anything with a constant term, anything with branches. The linear ones form a thin, structured slice of all possible functions. The slice happens to be the slice on which all of classical machine learning and most of deep learning runs.

Every linear function is a matrix (and vice versa)

The deep fact: every linear function from ℝⁿ to ℝᵐ is exactly representable by an m × n matrix. Not “can be approximated by.” Not “can be written as.” Is. The correspondence is one-to-one. Pick a linear function, you get a unique matrix. Pick a matrix, you get a unique linear function. The two sides of the equivalence are interchangeable — and the matrix view is just the bookkeeping that makes the function easy to apply on a computer.

Why this works comes from one fact about ℝⁿ: every vector x decomposes uniquely as a linear combination of the standard basis vectors:

x = x₁ e₁ + x₂ e₂ + ⋯ + xₙ eₙ

…where eⱼ is the vector with a 1 in slot j and zeros elsewhere. So any vector in ℝⁿ is just a recipe: “how much of e₁, how much of e₂, …” — that’s literally what its coordinates are.

Now apply linearity, twice:

f(x) = f( x₁e₁ + x₂e₂ + ⋯ + xₙeₙ ) = x₁ f(e₁) + x₂ f(e₂) + ⋯ + xₙ f(eₙ)

Stare at that last line. It is the whole subject. To know f on every possible input, you only need to know f on the n standard basis vectors. Once you’ve written down those n output vectors — each living in ℝᵐ — applying f to any input x is just “blend the recorded outputs with the input’s coordinates as weights.”

So we record. We stack those n output vectors as the columns of an m × n array:

┌ ┐ A = │ f(e₁) f(e₂) ⋯ f(eₙ) │ ← n column vectors, each in ℝᵐ └ ┘ then f(x) = A x i.e. (Ax)ᵢ = Σⱼ Aᵢⱼ xⱼ ← the familiar mechanics, derived from the picture

That’s the matrix. It’s a tabulation. The columns are the function’s behavior on the basis. The mechanics of matrix-vector multiplication — “row times column dot products” — is what falls out when you compute the blend, but it is a consequence of the column picture, not the starting point.

See it move

The viz below is the column picture in motion. Pick a transformation; watch e₁ and e₂ land somewhere new; watch the matrix’s two columns update to be those landing positions. The “F” shape is along for the ride to show what the same function does to arbitrary points.

dotted = original basis · solid = where it lands
transformation
the matrix
0.87-0.50
0.500.87
column 1 = where e₁ lands   column 2 = where e₂ lands
Pick a transformation. The matrix on the right is *the function*. Its first column is wherever e₁ lands; its second column is wherever e₂ lands. Read columns ↔ basis-vector destinations — that's the whole picture.

Three things to feel as you scrub the sliders:

  1. The matrix’s columns are literally the destinations of the basis vectors. Drag θ to 90° — the first column reads [0, 1] because e₁ lands on the y-axis. Pick “scale” with sₓ = 2, sᵧ = 1 — the first column is [2, 0] because e₁ got stretched but stayed on its axis. The matrix is a snapshot of the function’s effect on the basis.
  2. Shear is linear; it just doesn’t preserve angles. It’s tempting to think “linear” means “rigid” or “rotation-like.” It doesn’t. A shear distorts the unit square into a parallelogram — but it still preserves vector sums and scaling, which is all the definition requires.
  3. Some columns can have negative entries. Pick a negative rotation angle or a negative scale. The basis vector lands across the axis; the column flips sign. The math handles it without complaint.

The unit square (dashed, dim) is the original picture; the filled “F” is the result. Watching both shapes transform tells you the function’s effect on every point — because every point in the plane is some x₁e₁ + x₂e₂, and linearity carries it along for the ride.

The classic transformations, each as a matrix

Three worth pinning to memory because they recur everywhere:

Rotation by θ: Scale (sₓ, sᵧ): Shear (factor k): ┌ cos θ −sin θ ┐ ┌ sₓ 0 ┐ ┌ 1 k ┐ │ │ │ │ │ │ └ sin θ cos θ ┘ └ 0 sᵧ ┘ └ 0 1 ┘

For each, work out where e₁ and e₂ go and confirm they match the columns:

Three matrices, three pictures, same rule. The mechanics weren’t memorized; they were derived by asking “where does this function send each basis vector?”

— think, then check —

A matrix is a linear function from ℝⁿ to ℝᵐ. The 2D array of numbers is one way to write that function down — it tabulates where the function sends the standard basis vectors, one per column. Everything else (matmul, eigenvectors, determinants) follows from taking that view seriously.

Now make it run — gemv as “function as code”

The compiled embodiment of “matrix as function” is gemv — general matrix-vector multiply, y = Ax. Each output element is one dot product of a row of A with the input x:

y[i] = ⟨ row_i(A) , x ⟩ for i = 0 .. m-1

So the kernel is — almost embarrassingly — a stack of the dot product we already built and dissected in §1.2:

gemv_avx.c AVX2 + FMA · x86-64
 */
#include <stddef.h>

float dot_avx(const float* a, const float* b, int n);

void gemv_avx(const float* A, const float* x, float* y, int m, int n) {
    for (int i = 0; i < m; i++) {

That’s the whole thing. m rows, m dot products, each one of length n. The function the matrix represents — applied. Here’s the NEON twin, running natively on this Apple Silicon host:

gemv_neon.c NEON · arm64

float dot_neon(const float* a, const float* b, int n);

void gemv_neon(const float* A, const float* x, float* y, int m, int n) {
    for (int i = 0; i < m; i++) {

The test harness for this section computes R(45°) · e_1 and checks it lands at (\cos 45°, \sin 45°) = (0.7071, 0.7071) — which is the first column of R(45°). The “column = where the basis vector lands” claim is now tested in CI:

test_gemv.c (excerpt) rotation sanity test

int main(void) {
    /* Rotation-of-e1 sanity check: R(45°) sends e1 -> (cos45, sin45). */
    const float deg = 45.0f;
    const float r   = deg * (float)M_PI / 180.0f;
    const float c = cosf(r), s = sinf(r);
    /* Row-major 2x2:  [[c, -s], [s, c]]  -- column 1 is (c, s), the image of e1. */
    float R[4] = { c, -s, s, c };
    float e1[2] = { 1.0f, 0.0f };
    float out[2] = { 0.0f, 0.0f };
    gemv_scalar(R, e1, out, 2, 2);
    if (!approx(out[0], c, 1e-5f) || !approx(out[1], s, 1e-5f)) {

This is the section’s whole point compressed into 12 lines: the matrix is the function, and the way you check that is to apply the function to a basis vector and confirm you get the column you expected. The same test runs for an arbitrary 64×256 matrix in the second half of the file, with the SIMD kernel checked against a scalar reference — but the rotation case is the one that means something.

Where this lives in real ML code

PyTorch’s nn.Linear is exactly the function we’ve been discussing, with one bookkeeping quirk worth knowing about: it stores the weight as shape (out_features, in_features) and computes y = x @ A.T + b inside forward() (pytorch/torch/nn/modules/linear.py). The transpose is purely an indexing convenience — keeps memory accesses contiguous for the common batched case — but mathematically it’s the same matrix-vector function we just built. When you write:

layer = nn.Linear(in_features=784, out_features=256)
y     = layer(x)         # x: (..., 784)  →  y: (..., 256)

…you’ve created a learnable linear function from ℝ⁷⁸⁴ to ℝ²⁵⁶. The 256 × 784 array of weights is the column picture — each of the 784 columns says where one input dimension lands in the 256-dim output space. Training the layer means adjusting where the basis vectors go. That’s the entire job.

Attention preview. When you reach Chapter 13, the operation Q = X W_Q — applying the “query projection” to your inputs — is exactly this. W_Q is a learned linear function; Q is the result of running every input vector through it. K = X W_K and V = X W_V are two more linear functions applied to the same inputs. Attention’s magic is in how the three outputs interact downstream — but the projections themselves are nothing but “matrix-as-function,” batched. Internalizing this section now makes Ch.13 a small step instead of a leap.

— think, then check —

Trick! A 3 × 2 matrix only has two columns — it represents a function from ℝ² to ℝ³, so it has 2 input dimensions (= columns) and 3 output dimensions (= rows). Each column is in ℝ³ and tells you where one of the two basis vectors of ℝ² lands. The phrasing of the question was an invitation to confuse rows and columns; in this book, rows ↔ output dimensions, columns ↔ input dimensions. m × n means m rows, n columns, function from ℝⁿ → ℝᵐ.

Matrix multiplication is function composition (preview)

We’ve been doing matrix-vector multiplication: f(x) = Ax. Now consider applying two linear functions in sequence — first B, then A. The composed function is (A ∘ B)(x) = A(Bx). By the same column-picture argument, that composed function is also linear, so it’s also a matrix. The matrix you get is exactly A · B — matrix multiplication. Composition order is preserved: rightmost applies first, then move left.

That observation is the full subject of §2.2, with all the indexing and the three independent axes that drive tiled gemm. For now the only thing to bank is this: the rule (AB)x = A(Bx) isn’t a coincidence of notation. The two sides describe the same composed function, and matmul is the bookkeeping that makes them agree.

— think, then check —

B is a function ℝᵏ → ℝⁿ. A is a function ℝⁿ → ℝᵐ. So A∘B is a function ℝᵏ → ℝᵐ — and its matrix representation is m × k. AB is an m × k matrix.

The dimension constraint — A’s number of columns must equal B’s number of rows — exists because composition requires B’s output space to match A’s input space. The “inner” dimension n is what the functions agree on; it’s the dimension B lands you in and the dimension A consumes. The matrix shapes are forced by the function-composition picture, not the other way around. This is why §2.2 cares about three independent axes — those are exactly the m, n, k of a matmul, and tiling each one independently is what FlashAttention does.

END OF CH.2 §1 — Matrices as functions.
Built: live MatrixAsFunction viz, compiled gemv kernels (AVX2 + NEON) with a rotation-sanity test that runs in CI, three recall prompts at easy / medium / hard.
Coming next: §2.2 — Matrix multiplication as composition; the three independent axes.