LayerNorm — derivation and what it actually does
Every transformer wraps each attention block and FFN block with a normalisation operation. Without it, training is unstable: activations drift to large magnitudes layer by layer; the softmax in attention saturates; gradients vanish or explode. Ba 2016 “Layer Normalization” introduced LayerNorm as the fix that finally worked for sequence models. Unlike BatchNorm (Ioffe & Szegedy 2015) which normalises across the batch dimension, LayerNorm normalises across the feature dimension of each token independently. The operational consequence: LayerNorm works at batch size 1, works during inference the same way it works during training, works with variable-length sequences. These constraints are why every transformer uses it.
The formula
For a single token’s activation vector x ∈ ℝ^d (one row of the activations matrix), LayerNorm computes:
Two-stage operation: normalise (parameter-free, per-token), then affine (learned, per-feature). The normalise stage is what stabilises the activations; the affine stage is what gives the layer expressivity back — without γ, β the network would lose the ability to represent any non-unit-variance, non-zero-mean intermediate distribution.
LayerNorm’s scope is one token’s activation vector. The statistics μ and σ² come from that one vector’s d feature values; they have nothing to do with other tokens in the same batch or sequence.
Why per-feature γ and β
Both γ and β are vectors of length d (one entry per feature dimension), and they’re learned. Why?
Without the affine stage, every LayerNorm output has zero mean and unit variance per token — that’s a hard structural constraint. But the network might need certain features to be larger than others (e.g., positional features dominating semantic ones, or one head’s contribution being weighted differently), or to have non-zero mean (e.g., the residual stream encoding a sentence-level bias). The γ, β stage gives the network back this freedom: it can choose to keep the unit-variance constraint (set γ = 1, β = 0) or scale specific features up/down (γ_j ≠ 1) or shift them (β_j ≠ 0).
The expressivity argument: without γ, β, two LayerNorms with the same input always produce the same output regardless of the surrounding network. With γ, β, the layer can adapt its output distribution to what the next layer needs. In practice γ tends to learn values in [0.1, 10] and β in [−0.5, 0.5] — small adjustments to the parameter-free normalisation, not radical changes.
LayerNorm vs BatchNorm — which axis are we normalising?
Different axis:
BatchNorm: per-feature statistics, taken across the batch. For each feature j, compute μ_j and σ²_j across all B examples in the batch.
LayerNorm: per-token statistics, taken across the features. For each token, compute μ and σ² across its d feature values.
Why this matters for transformers:
- BatchNorm needs running statistics for inference (since you can’t compute batch stats with batch size 1). LayerNorm doesn’t — the per-token statistics are computed on-the-fly from the input itself, identically at training and inference.
- BatchNorm couples examples in a batch. Two sequences of different lengths can’t share the same statistics cleanly. LayerNorm treats each token independently.
- BatchNorm is unstable at small batch sizes. Modern LLM training often uses gradient accumulation with effective batch sizes of millions of tokens but micro-batch sizes of 1-4 sequences — BatchNorm would fail here.
- Decoder-only models generate token-by-token at inference (batch of 1 sequence, generating 1 token). LayerNorm just works; BatchNorm would either need stored statistics (frozen) or fail.
The combination — batch-independent, length-independent, train-equals-inference — is why LayerNorm displaced BatchNorm in sequence models around 2017 and is now universal in transformers.
Now make it run
The C kernel implements LayerNorm and verifies the output has zero mean and unit variance per row when γ = 1 and β = 0:
static void layernorm(const float* X, const float* g, const float* b,
float* Y, int N, int D)
{
const float eps = 1e-5f;
for (int i = 0; i < N; i++) {
const float* x = X + (size_t)i * D;
float* y = Y + (size_t)i * D;
/* mean */
float mu = 0;
for (int j = 0; j < D; j++) mu += x[j];
mu /= D;
/* variance */
float var = 0;
for (int j = 0; j < D; j++) { float dx = x[j] - mu; var += dx * dx; }
var /= D;
float inv = 1.0f / sqrtf(var + eps);
/* normalise + affine */
for (int j = 0; j < D; j++) y[j] = g[j] * ((x[j] - mu) * inv) + b[j];
}
}
Output:
LayerNorm output (γ=1, β=0): max |row mean| = 1.44e-06 (should be ~0)
LayerNorm output (γ=1, β=0): max |row var − 1| = 3.28e-06 (should be ~0)
LayerNorm produces zero-mean, unit-variance rows to roundoff. The 1e-6 errors are float32 epsilon — accumulating d = 512 sums in float32 produces this much drift. Modern transformer implementations sometimes compute the LayerNorm statistics in fp32 even when the rest of the layer is in fp16/bf16, precisely to keep this error from compounding.
Step 1 (statistics across features):
μ = (1/d) Σ_j x_j (mean across d features)
σ² = (1/d) Σ_j (x_j − μ)² (variance across d features)
Step 2 (normalise):
x̂_j = (x_j − μ) / √(σ² + ε)
ε ≈ 1e-5: prevents division by zero when σ² ≈ 0 (e.g., when all entries equal).
Step 3 (learned affine):
y_j = γ_j · x̂_j + β_j
γ ∈ ℝ^d, β ∈ ℝ^d are learned per-feature.
Key: normalisation is per-token (uses statistics of this row only); the affine γ, β are per-feature (same across all tokens). After step 3, the output has whatever mean and variance γ, β specify — the network can choose how to use the normalised representation.
The gradient — why this helps training
The gradient through LayerNorm is more interesting than it looks. Working out ∂L/∂x from ∂L/∂y:
The two correction terms are what makes LayerNorm a gradient whitener. The mean correction removes the component of the upstream gradient that affects only the mean (the network can’t change the input mean — LayerNorm just subtracts it back out). The variance correction removes the component that affects only the variance. What’s left is the gradient signal that actually changes the direction of x.
The empirical effect: without LayerNorm, transformers above ~6 layers diverge during training — activations explode at intermediate layers. With LayerNorm, the same architecture trains stably at 96+ layers. The “gradient whitening” is what enables the depth that makes large transformers work.
Without the corrections: if you backprop through the normalisation while pretending μ and σ are constants, you get ∂L/∂x_j ≈ γ/σ · ∂L/∂z_j — basically the naive chain rule treating the normalisation as a linear operation.
What’s wrong: μ and σ are NOT constants; they depend on every entry of x. If you change x_j, you change BOTH the (x_j − μ) numerator AND the σ denominator. The “naive” gradient ignores this and ends up gradient-descending in a direction that doesn’t actually decrease the loss — because part of the apparent gradient is just an artifact of the mean/variance shifting along with x.
What the mean-correction terms do:
- − mean(g): removes the component of the gradient that would shift the row’s mean. The network can’t change the input mean — LayerNorm will just re-subtract it. So that gradient component is wasted “force.” Subtracting it concentrates the gradient on directions that actually change the output.
- − z · mean(g ⊙ z): removes the component along the variance direction. The network can’t change the row’s variance either — LayerNorm will rescale. So this component is also wasted.
What’s left is the gradient projection onto the (d − 2)-dimensional subspace ORTHOGONAL to the constant-vector and the input-z direction — the gradient that actually corresponds to changes in the output the network can produce.
Why naive whitening fails empirically: the un-corrected gradient has spurious components that push features in directions LayerNorm will immediately undo. These spurious gradients can be much larger than the “real” gradient (especially when σ is small). Without correction: training is noisy and slow because most of each step is “fixing what LayerNorm already fixed.” With correction: each step moves the loss in a productive direction.
This is why LayerNorm matters more than it looks: it’s not just normalisation in the forward pass; the BACKWARD pass through it is doing free gradient-whitening every step. The same per-token coupling that’s a constraint in the forward pass is a gradient-shaping benefit in the backward pass.
This connects back to Ch.8’s preconditioning intuition: Adam estimates per-parameter scale; SGD with momentum estimates direction. LayerNorm achieves a related effect WITHIN the forward pass by ensuring no single dimension can dominate, and WITHIN the backward pass by removing the gradient components LayerNorm would undo. The three together (Adam + LayerNorm + residual connections [§14.3]) are what make 70B+ parameter transformers trainable end-to-end.
Next: §14.2 — RMSNorm and pre-norm placement. The “drop the mean” simplification that every modern open model uses, and the placement choice (pre-norm before the sublayer vs post-norm after) that affects gradient flow.