NEURONS, LAYERS, MLPS
Section 10.1
01

From perceptron to MLP

A neural network is a stack of two operations applied in alternation: a linear function (Ch.2 §1’s matrix = function) and a fixed pointwise nonlinearity. That’s the entire architecture. Everything from Rosenblatt’s 1958 perceptron through GPT-5 is a variation on “stack more of these, give them better nonlinearities, sometimes connect them with residual paths.” This section walks the lineage from the single perceptron — a linear classifier that famously can’t solve XOR — to the multilayer perceptron (MLP) and its universal-approximation property. The pattern that emerges in this section is the pattern of every later architecture: choose an interesting nonlinearity, choose how to stack, choose how the gradients flow. The transformer in Ch.13 is a particular highly-engineered version; the MLP here is its bare skeleton.

The perceptron — one neuron, one decision

The perceptron, in modern notation, is just a single linear classifier:

y = σ( w · x + b ) w, b are parameters (a weight vector and a scalar bias); σ is the sigmoid (Rosenblatt used a step function — same idea).

Geometrically, w · x + b = 0 is a hyperplane in ℝᵈ (a line in 2D, a plane in 3D). The classifier outputs class 1 on one side and class 0 on the other. A perceptron is a linear decision boundary.

This is enough to classify many things — linearly-separable problems abound. But not XOR (Ch.9 §3), and not any problem where the right decision boundary is curved. The Minsky & Papert 1969 “Perceptrons” book proved this rigorously and argued NNs were a dead end. They were right about the perceptron and wrong about NNs in general — but it took 20 years for the field to figure out what was missing. The answer turned out to be: stack more linear classifiers with nonlinearities between them.

The MLP — same building blocks, stacked

A multilayer perceptron (MLP) is exactly what it sounds like — multiple perceptron-like units stacked:

h₁ = φ( W₁ · x + b₁ ) ← layer 1: linear, then nonlinearity φ h₂ = φ( W₂ · h₁ + b₂ ) ← layer 2 … y = W_L · h_{L-1} + b_L ← output layer (often no final nonlinearity for regression, softmax for classification)

Each W_i is a learned matrix; each b_i is a learned bias vector; φ is the elementwise nonlinearity (the focus of §10.2). The depth L is the number of layers; the width of each layer is W_i’s number of rows. Modern transformers have L ≈ 60–120; their MLP sublayers have widths of 4096–14336 (4× the embedding dim is the standard “FFN inner” rule).

The viz below shows three classifiers on the same 2D “two moons” problem — a curving non-linearly-separable layout that’s a step harder than XOR:

accuracy 85.0%
model y = σ(w·x + b)
parameters 3
The single linear classifier can only carve the space with a line — fundamentally incapable of separating the two interleaved moons. Adding a ReLU hidden layer gives piecewise-linear boundaries (each hidden unit contributes a half-plane that can bend the overall decision). Two hidden layers compose these bends into the curves needed for the moons. Depth is what lets the same family of building blocks represent qualitatively richer decision boundaries.
The two-moons problem on three classifiers. Class 0 is the teal moon (bottom-left), class 1 the ochre moon (top-right). The single linear classifier (perceptron) is a one-line decision rule — it can't separate them. Adding nonlinear hidden layers does. The qualitative jump from 'line' to 'curving boundary' is what makes MLPs interesting; the principle scales all the way up to a 405B-parameter transformer.

Click between the three. The linear classifier is hopeless — it can only draw a line, and there’s no line that separates the moons. The 1-hidden-layer MLP can carve a piecewise-linear boundary that fits the curve roughly. The 2-hidden-layer MLP gets it essentially perfect. The qualitative jump from “line” to “curving boundary” is what makes MLPs interesting; that jump is what depth + nonlinearity buys you.

— think, then check —

The single perceptron computes y = σ(w·x + b). The decision boundary is where σ(z) = 0.5, i.e. where w·x + b = 0 — a straight line.

An MLP computes y = W₂·φ(W₁·x + b₁) + b₂. The first linear layer projects x into a hidden space; the ReLU (or whatever φ is) introduces a kink at each unit’s zero-crossing — that’s where the curving comes from. The second linear layer combines the bent intermediate outputs into the final score. The composition is no longer a single linear function — it’s a piecewise-linear function whose ‘pieces’ are bounded by the input regions where each ReLU is active.

So MLPs gain expressivity through composition of linear maps + nonlinearities. Without the nonlinearity (e.g. φ = identity), MLPs collapse: W₂·(W₁·x + b₁) + b₂ = (W₂W₁)·x + (W₂b₁ + b₂) — still a single linear function. The nonlinearity is what prevents the layers from collapsing into one.

The universal approximation theorem

Once you have stacked-linear-and-nonlinear units, an extremely strong result holds:

Theorem (Cybenko 1989; Hornik, Stinchcombe, White 1989; Hornik 1991): For any continuous function f: K ⊆ ℝᵈ → ℝᵐ on a compact domain K, and any tolerance ε > 0, there exists a 2-layer MLP ĥ(x) = W₂ · σ(W₁ · x + b₁) + b₂ with finite (but possibly very large) hidden width, such that sup_{x ∈ K} |f(x) − ĥ(x)| < ε. (σ can be any continuous non-polynomial activation — sigmoid, ReLU, GELU, etc. — the theorem is robust to the choice.)

This is the universal approximation theorem. It says existence — for any continuous function you can name, some MLP with one hidden layer represents it to whatever accuracy. The theorem does not say:

What UA does give: existence of a solution is never the bottleneck. If your problem has a continuous functional form, some MLP can approximate it. The interesting questions are (a) how big an MLP, (b) can SGD find it, (c) how big a training set is needed.

— think, then check —

(1) Width vs depth tradeoff. UA says a 2-layer MLP CAN represent any continuous function — but the required hidden width can be exponentially large in the input dimension for some functions (parity, certain compositional functions). Deeper networks can represent the same functions with width polynomial in the input dimension. Telgarsky 2016 (“Benefits of depth in neural networks”) formalises this — there exist functions where 2-layer MLPs need exp(d) units but 3-layer MLPs need only O(d).

(2) Optimisation. UA is an existence result — it doesn’t say SGD can find the weights. Deep networks often optimise better in practice (with batchnorm/residual paths from Ch.14 / Ch.15) than equally-large shallow ones with the same parameter count.

(3) Structural inductive bias. The right architecture for a problem encodes useful inductive biases — convolutional networks for translation invariance, transformers for sequence ordering. These structural choices live IN the architecture, not in a flat MLP. So we use deep, specialised architectures because they encode prior knowledge about the task — UA gets you existence, but the bias of the model shape gets you sample efficiency.

This is why post-2012 deep learning is ‘deep’ and not ‘wide’ — depth empirically buys both expressivity per parameter and trainability via well-designed gradient flow (residual connections, layer norm, careful init).

Counting parameters

A useful exercise — what does a “small MLP” cost?

MLP: d_in → h₁ → h₂ → … → h_L → d_out Parameter count: layer 1: d_in · h₁ + h₁ (W and b) layer 2: h₁ · h₂ + h₂ ... layer L+1: h_L · d_out + d_out For an MNIST classifier (28² → 128 → 10): 784·128 + 128 + 128·10 + 10 = 101,770 params For a transformer block's MLP sublayer (4096 → 16384 → 4096): 4096·16384 + 16384 + 16384·4096 + 4096 = 134,236,672 params ≈ 134M For a 32-layer model with that-shape MLP sublayers: ~4.3B params just in MLPs.

The MLP sublayer is where most of a transformer’s parameters live — typically 2/3 of the total. Attention is a smaller fraction; embeddings depend on vocabulary size. Whenever you hear “the LLM is 70B parameters,” roughly half of those parameters are inside MLP weight matrices.

The thing perceptron and MLP have in common with everything later

Stand back. Perceptron: one linear unit. MLP: stack of (linear + nonlinearity). Transformer: stack of (linear + attention + linear + nonlinearity) — same building blocks plus the attention mechanism. Modern frontier models are at heart the perceptron architecture repeated tens of times with nonlinearities and attention between layers — the structural innovation between 1958 and 2025 is mostly in what gets composed, not in that composition is the operation.

This is why “matrices are functions” (Ch.2 §1) was the load-bearing claim of Part I: every neural network architecture is a composition of matrix-functions with nonlinearities. The architecture catalogue (CNN, RNN, Transformer, Mamba) is a catalogue of which matrix-functions and how they’re composed, not a fundamentally different mathematical object.

— think, then check —

The MLP sublayer is where the model does per-token transformations of representations. Each token’s hidden vector (4096+ dims) gets pushed through a dense 4096 → 16384 → 4096 stack — two big matmuls and a nonlinearity in between.

From Ch.6 §3: in a 4096-dim space, you can pack exp(c · 4096) near-orthogonal directions — astronomically many ‘features’ could be encoded as superposed near-orthogonal vectors. Anthropic’s superposition work (Elhage et al. 2022) showed empirically that NN hidden layers DO use this exponential capacity — features outnumber neurons by orders of magnitude.

From §10.1: a 2-layer MLP can represent any continuous function on a compact domain (UA theorem). For the per-token transformation ‘take this representation and produce the next-layer representation,’ a 2-layer MLP is the minimum-viable universal function approximator.

Combining: the MLP sublayer’s job is to compute, per token, a near-arbitrary function of its 4096-dim input to produce its 4096-dim output. The exponential-packing capacity of 4096-dim space + UA theorem says this requires a large dense MLP. The 134M parameters per MLP sublayer aren’t waste — they’re what’s needed to be a near-universal function on 4096 → 4096 with enough capacity to encode all the features the layer’s input might carry. Stack 32 such MLPs, plus attention between them (Ch.13), and you have a transformer.

The architecture choice ‘most params in MLPs’ is therefore the optimal allocation given (a) representational capacity from high-D geometry and (b) universal approximation. Attention is where the model decides which tokens interact; MLPs are where it transforms each token’s representation. Both are needed; MLPs are the more parameter-heavy because per-token universal function approximation needs more than per-token attention routing does.

END OF CH.10 §1 — From perceptron to MLP.
Built: DecisionBoundary viz (three classifiers on the same two-moons problem; click between them to see the decision boundary qualitatively change). Three recall items: easy (why depth matters), medium (three reasons modern networks are deep despite UA), hard (parameter allocation argument combining UA + high-D capacity).
Coming next: §10.2 — Activation functions. ReLU was the breakthrough; GELU, SiLU, and SwiGLU are the modern transformer choices.