From perceptron to MLP

Section 10.1

From perceptron to MLP

A neural network is a stack of two operations applied in alternation: a linear function (Ch.2 §1’s matrix = function) and a fixed pointwise nonlinearity. That’s the entire architecture. Everything from Rosenblatt’s 1958 perceptron through GPT-5 is a variation on “stack more of these, give them better nonlinearities, sometimes connect them with residual paths.” This section walks the lineage from the single perceptron — a linear classifier that famously can’t solve XOR — to the multilayer perceptron (MLP) and its universal-approximation property. The pattern that emerges in this section is the pattern of every later architecture: choose an interesting nonlinearity, choose how to stack, choose how the gradients flow. The transformer in Ch.13 is a particular highly-engineered version; the MLP here is its bare skeleton.

The perceptron — one neuron, one decision

The perceptron historical model A single-neuron linear classifier introduced by Rosenblatt 1958. Computes y = σ(w·x + b) (or, in Rosenblatt's original form, a step function instead of σ). Trained by an online weight-update rule that converges if and only if the data is linearly separable. The first 'neural network' anyone called by that name — and the model whose limitations (no XOR, Minsky & Papert 1969) helped end the first AI boom. Then → now: 'perceptron' meant Rosenblatt's specific 1958 model. Today it usually means 'one linear unit' in a network; modern usage drops the historical baggage. The 'multilayer perceptron' (MLP) name survives but you'll equally see 'feedforward network' or 'dense network.' , in modern notation, is just a single linear classifier:

y = σ( w · x + b ) w, b are parameters (a weight vector and a scalar bias); σ is the sigmoid (Rosenblatt used a step function — same idea).

Geometrically, w · x + b = 0 is a hyperplane in ℝᵈ (a line in 2D, a plane in 3D). The classifier outputs class 1 on one side and class 0 on the other. A perceptron is a linear decision boundary.

This is enough to classify many things — linearly-separable problems abound. But not XOR (Ch.9 §3), and not any problem where the right decision boundary is curved. The Minsky & Papert 1969 “Perceptrons” book proved this rigorously and argued NNs were a dead end. They were right about the perceptron and wrong about NNs in general — but it took 20 years for the field to figure out what was missing. The answer turned out to be: stack more linear classifiers with nonlinearities between them.

The MLP — same building blocks, stacked

A multilayer perceptron core architecture An MLP — a stack of (linear layer → fixed elementwise nonlinearity) units, with an output layer at the end. The dense building block of nearly every neural network architecture: transformers have MLPs inside each block (the 'FFN' or 'MLP' sublayer); convnets have them in their final classifier heads; image-classification ResNets are MLP-shaped with convolutional weight matrices. The universal-approximation theorem (Cybenko 1989, Hornik 1991) proved that even a one-hidden-layer MLP with enough units can approximate any continuous function on a compact domain to arbitrary precision. (MLP) is exactly what it sounds like — multiple perceptron-like units stacked:

h₁ = φ( W₁ · x + b₁ ) ← layer 1: linear, then nonlinearity φ h₂ = φ( W₂ · h₁ + b₂ ) ← layer 2 … y = W_L · h_{L-1} + b_L ← output layer (often no final nonlinearity for regression, softmax for classification)

Each W_i is a learned matrix; each b_i is a learned bias vector; φ is the elementwise nonlinearity (the focus of §10.2). The depth L is the number of layers; the width of each layer is W_i’s number of rows. Modern transformers have L ≈ 60–120; their MLP sublayers have widths of 4096–14336 (4× the embedding dim is the standard “FFN inner” rule).

The viz below shows three classifiers on the same 2D “two moons” problem — a curving non-linearly-separable layout that’s a step harder than XOR:

accuracy 85.0%

model y = σ(w·x + b)

parameters 3

The single linear classifier can only carve the space with a line — fundamentally incapable of separating the two interleaved moons. Adding a ReLU hidden layer gives piecewise-linear boundaries (each hidden unit contributes a half-plane that can bend the overall decision). Two hidden layers compose these bends into the curves needed for the moons. Depth is what lets the same family of building blocks represent qualitatively richer decision boundaries.

The two-moons problem on three classifiers. Class 0 is the teal moon (bottom-left), class 1 the ochre moon (top-right). The single linear classifier (perceptron) is a one-line decision rule — it can't separate them. Adding nonlinear hidden layers does. The qualitative jump from 'line' to 'curving boundary' is what makes MLPs interesting; the principle scales all the way up to a 405B-parameter transformer.

Click between the three. The linear classifier is hopeless — it can only draw a line, and there’s no line that separates the moons. The 1-hidden-layer MLP can carve a piecewise-linear boundary that fits the curve roughly. The 2-hidden-layer MLP gets it essentially perfect. The qualitative jump from “line” to “curving boundary” is what makes MLPs interesting; that jump is what depth + nonlinearity buys you.

— think, then check —

The single perceptron computes y = σ(w·x + b). The decision boundary is where σ(z) = 0.5, i.e. where w·x + b = 0 — a straight line.

An MLP computes y = W₂·φ(W₁·x + b₁) + b₂. The first linear layer projects x into a hidden space; the ReLU (or whatever φ is) introduces a kink at each unit’s zero-crossing — that’s where the curving comes from. The second linear layer combines the bent intermediate outputs into the final score. The composition is no longer a single linear function — it’s a piecewise-linear function whose ‘pieces’ are bounded by the input regions where each ReLU is active.

So MLPs gain expressivity through composition of linear maps + nonlinearities. Without the nonlinearity (e.g. φ = identity), MLPs collapse: W₂·(W₁·x + b₁) + b₂ = (W₂W₁)·x + (W₂b₁ + b₂) — still a single linear function. The nonlinearity is what prevents the layers from collapsing into one.

↳ §10.1 depth

The universal approximation theorem

Once you have stacked-linear-and-nonlinear units, an extremely strong result holds:

Theorem (Cybenko 1989; Hornik, Stinchcombe, White 1989; Hornik 1991): For any continuous function f: K ⊆ ℝᵈ → ℝᵐ on a compact domain K, and any tolerance ε > 0, there exists a 2-layer MLP ĥ(x) = W₂ · σ(W₁ · x + b₁) + b₂ with finite (but possibly very large) hidden width, such that sup_{x ∈ K} |f(x) − ĥ(x)| < ε. (σ can be any continuous non-polynomial activation — sigmoid, ReLU, GELU, etc. — the theorem is robust to the choice.)

This is the universal approximation theorem theory result A 2-layer MLP (one hidden layer) with enough hidden units can approximate any continuous function on a compact domain to arbitrary precision. Cybenko 1989 for sigmoid activations; Hornik 1991 for arbitrary non-polynomial nonlinearity. Justifies the existence of NN solutions for arbitrary tasks — does NOT say how to find them, or how many units are needed (sometimes exponentially many for the width to suffice without depth). . It says existence — for any continuous function you can name, some MLP with one hidden layer represents it to whatever accuracy. The theorem does not say:

How many units are needed. Sometimes exponentially many in the input dimension; sometimes just a few. The bound is rarely useful in practice.
How to find the weights. Convergence of training to a good approximation is a separate problem (Ch.8) and is not guaranteed.
Whether the resulting model generalises. UA is a memorisation-on-K bound; whether the learned MLP performs well on points outside the training data is the generalisation question (Ch.8 §1).

What UA does give: existence of a solution is never the bottleneck. If your problem has a continuous functional form, some MLP can approximate it. The interesting questions are (a) how big an MLP, (b) can SGD find it, (c) how big a training set is needed.

— think, then check —

(1) Width vs depth tradeoff. UA says a 2-layer MLP CAN represent any continuous function — but the required hidden width can be exponentially large in the input dimension for some functions (parity, certain compositional functions). Deeper networks can represent the same functions with width polynomial in the input dimension. Telgarsky 2016 (“Benefits of depth in neural networks”) formalises this — there exist functions where 2-layer MLPs need exp(d) units but 3-layer MLPs need only O(d).

(2) Optimisation. UA is an existence result — it doesn’t say SGD can find the weights. Deep networks often optimise better in practice (with batchnorm/residual paths from Ch.14 / Ch.15) than equally-large shallow ones with the same parameter count.

(3) Structural inductive bias. The right architecture for a problem encodes useful inductive biases — convolutional networks for translation invariance, transformers for sequence ordering. These structural choices live IN the architecture, not in a flat MLP. So we use deep, specialised architectures because they encode prior knowledge about the task — UA gets you existence, but the bias of the model shape gets you sample efficiency.

This is why post-2012 deep learning is ‘deep’ and not ‘wide’ — depth empirically buys both expressivity per parameter and trainability via well-designed gradient flow (residual connections, layer norm, careful init).

↳ §10.1 UA theorem

Counting parameters

A useful exercise — what does a “small MLP” cost?

MLP: d_in → h₁ → h₂ → … → h_L → d_out Parameter count: layer 1: d_in · h₁ + h₁ (W and b) layer 2: h₁ · h₂ + h₂ ... layer L+1: h_L · d_out + d_out For an MNIST classifier (28² → 128 → 10): 784·128 + 128 + 128·10 + 10 = 101,770 params For a transformer block's MLP sublayer (4096 → 16384 → 4096): 4096·16384 + 16384 + 16384·4096 + 4096 = 134,236,672 params ≈ 134M For a 32-layer model with that-shape MLP sublayers: ~4.3B params just in MLPs.

The MLP sublayer is where most of a transformer’s parameters live — typically 2/3 of the total. Attention is a smaller fraction; embeddings depend on vocabulary size. Whenever you hear “the LLM is 70B parameters,” roughly half of those parameters are inside MLP weight matrices.

The thing perceptron and MLP have in common with everything later

Stand back. Perceptron: one linear unit. MLP: stack of (linear + nonlinearity). Transformer: stack of (linear + attention + linear + nonlinearity) — same building blocks plus the attention mechanism. Modern frontier models are at heart the perceptron architecture repeated tens of times with nonlinearities and attention between layers — the structural innovation between 1958 and 2025 is mostly in what gets composed, not in that composition is the operation.

This is why “matrices are functions” (Ch.2 §1) was the load-bearing claim of Part I: every neural network architecture is a composition of matrix-functions with nonlinearities. The architecture catalogue (CNN, RNN, Transformer, Mamba) is a catalogue of which matrix-functions and how they’re composed, not a fundamentally different mathematical object.

— think, then check —

The MLP sublayer is where the model does per-token transformations of representations. Each token’s hidden vector (4096+ dims) gets pushed through a dense 4096 → 16384 → 4096 stack — two big matmuls and a nonlinearity in between.

From Ch.6 §3: in a 4096-dim space, you can pack exp(c · 4096) near-orthogonal directions — astronomically many ‘features’ could be encoded as superposed near-orthogonal vectors. Anthropic’s superposition work (Elhage et al. 2022) showed empirically that NN hidden layers DO use this exponential capacity — features outnumber neurons by orders of magnitude.

From §10.1: a 2-layer MLP can represent any continuous function on a compact domain (UA theorem). For the per-token transformation ‘take this representation and produce the next-layer representation,’ a 2-layer MLP is the minimum-viable universal function approximator.

Combining: the MLP sublayer’s job is to compute, per token, a near-arbitrary function of its 4096-dim input to produce its 4096-dim output. The exponential-packing capacity of 4096-dim space + UA theorem says this requires a large dense MLP. The 134M parameters per MLP sublayer aren’t waste — they’re what’s needed to be a near-universal function on 4096 → 4096 with enough capacity to encode all the features the layer’s input might carry. Stack 32 such MLPs, plus attention between them (Ch.13), and you have a transformer.

The architecture choice ‘most params in MLPs’ is therefore the optimal allocation given (a) representational capacity from high-D geometry and (b) universal approximation. Attention is where the model decides which tokens interact; MLPs are where it transforms each token’s representation. Both are needed; MLPs are the more parameter-heavy because per-token universal function approximation needs more than per-token attention routing does.

↳ §10.1 parameter counting + Ch.5 §3 capacity argument

END OF CH.10 §1 — From perceptron to MLP.
Built: DecisionBoundary viz (three classifiers on the same two-moons problem; click between them to see the decision boundary qualitatively change). Three recall items: easy (why depth matters), medium (three reasons modern networks are deep despite UA), hard (parameter allocation argument combining UA + high-D capacity).
Coming next: §10.2 — Activation functions. ReLU was the breakthrough; GELU, SiLU, and SwiGLU are the modern transformer choices.