NORMALIZATION & RESIDUALS
Section 14.1
01

LayerNorm — derivation and what it actually does

Every transformer wraps each attention block and FFN block with a normalisation operation. Without it, training is unstable: activations drift to large magnitudes layer by layer; the softmax in attention saturates; gradients vanish or explode. Ba 2016 “Layer Normalization” introduced LayerNorm as the fix that finally worked for sequence models. Unlike BatchNorm (Ioffe & Szegedy 2015) which normalises across the batch dimension, LayerNorm normalises across the feature dimension of each token independently. The operational consequence: LayerNorm works at batch size 1, works during inference the same way it works during training, works with variable-length sequences. These constraints are why every transformer uses it.

The formula

For a single token’s activation vector x ∈ ℝ^d (one row of the activations matrix), LayerNorm computes:

Compute per-token statistics (across the d feature dimensions): μ = (1/d) · Σ_j x_j (mean) σ² = (1/d) · Σ_j (x_j − μ)² (variance — note: population variance, not sample) Normalise: x̂_j = (x_j − μ) / √(σ² + ε) for j = 1..d Apply learned affine: y_j = γ_j · x̂_j + β_j γ, β ∈ ℝ^d (learned per-feature) The ε term (typically 1e-5) guards against division by zero when variance is tiny.

Two-stage operation: normalise (parameter-free, per-token), then affine (learned, per-feature). The normalise stage is what stabilises the activations; the affine stage is what gives the layer expressivity back — without γ, β the network would lose the ability to represent any non-unit-variance, non-zero-mean intermediate distribution.

LayerNorm’s scope is one token’s activation vector. The statistics μ and σ² come from that one vector’s d feature values; they have nothing to do with other tokens in the same batch or sequence.

Why per-feature γ and β

Both γ and β are vectors of length d (one entry per feature dimension), and they’re learned. Why?

Without the affine stage, every LayerNorm output has zero mean and unit variance per token — that’s a hard structural constraint. But the network might need certain features to be larger than others (e.g., positional features dominating semantic ones, or one head’s contribution being weighted differently), or to have non-zero mean (e.g., the residual stream encoding a sentence-level bias). The γ, β stage gives the network back this freedom: it can choose to keep the unit-variance constraint (set γ = 1, β = 0) or scale specific features up/down (γ_j ≠ 1) or shift them (β_j ≠ 0).

The expressivity argument: without γ, β, two LayerNorms with the same input always produce the same output regardless of the surrounding network. With γ, β, the layer can adapt its output distribution to what the next layer needs. In practice γ tends to learn values in [0.1, 10] and β in [−0.5, 0.5] — small adjustments to the parameter-free normalisation, not radical changes.

LayerNorm vs BatchNorm — which axis are we normalising?

Activations matrix shape: [batch=B, sequence=N, features=d] BatchNorm normalises along: batch axis (B) for each (sequence position, feature) pair, take statistics across B examples. ↳ μ, σ² have shape [N, d] LayerNorm normalises along: feature axis (d) for each (batch element, sequence position) pair, take statistics across d features. ↳ μ, σ² have shape [B, N] Why feature-axis works for transformers: - identical formula at any batch size, even B = 1. - identical formula at any sequence length, no batch-statistic running averages. - identical formula at inference time (no train/eval mode difference). - each token's normalisation is independent — natural for variable-length inputs. Why batch-axis breaks for transformers: - batches of variable length sequences have different valid positions per example. - inference latency suffers from running-stat accumulation. - small batch sizes destabilise statistics (BatchNorm with B = 1 is meaningless). - sequence-to-sequence training has noisy targets where batch statistics shift mid-training.
— think, then check —

Different axis:

BatchNorm: per-feature statistics, taken across the batch. For each feature j, compute μ_j and σ²_j across all B examples in the batch.

LayerNorm: per-token statistics, taken across the features. For each token, compute μ and σ² across its d feature values.

Why this matters for transformers:

  • BatchNorm needs running statistics for inference (since you can’t compute batch stats with batch size 1). LayerNorm doesn’t — the per-token statistics are computed on-the-fly from the input itself, identically at training and inference.
  • BatchNorm couples examples in a batch. Two sequences of different lengths can’t share the same statistics cleanly. LayerNorm treats each token independently.
  • BatchNorm is unstable at small batch sizes. Modern LLM training often uses gradient accumulation with effective batch sizes of millions of tokens but micro-batch sizes of 1-4 sequences — BatchNorm would fail here.
  • Decoder-only models generate token-by-token at inference (batch of 1 sequence, generating 1 token). LayerNorm just works; BatchNorm would either need stored statistics (frozen) or fail.

The combination — batch-independent, length-independent, train-equals-inference — is why LayerNorm displaced BatchNorm in sequence models around 2017 and is now universal in transformers.

Now make it run

The C kernel implements LayerNorm and verifies the output has zero mean and unit variance per row when γ = 1 and β = 0:

norms.c — layernorm C · per-row layer normalisation
static void layernorm(const float* X, const float* g, const float* b,
                      float* Y, int N, int D)
{
    const float eps = 1e-5f;
    for (int i = 0; i < N; i++) {
        const float* x = X + (size_t)i * D;
        float* y = Y + (size_t)i * D;
        /* mean */
        float mu = 0;
        for (int j = 0; j < D; j++) mu += x[j];
        mu /= D;
        /* variance */
        float var = 0;
        for (int j = 0; j < D; j++) { float dx = x[j] - mu; var += dx * dx; }
        var /= D;
        float inv = 1.0f / sqrtf(var + eps);
        /* normalise + affine */
        for (int j = 0; j < D; j++) y[j] = g[j] * ((x[j] - mu) * inv) + b[j];
    }
}

Output:

LayerNorm  output (γ=1, β=0):  max |row mean|     = 1.44e-06   (should be ~0)
LayerNorm  output (γ=1, β=0):  max |row var − 1|  = 3.28e-06   (should be ~0)

LayerNorm produces zero-mean, unit-variance rows to roundoff. The 1e-6 errors are float32 epsilon — accumulating d = 512 sums in float32 produces this much drift. Modern transformer implementations sometimes compute the LayerNorm statistics in fp32 even when the rest of the layer is in fp16/bf16, precisely to keep this error from compounding.

— think, then check —

Step 1 (statistics across features):

μ = (1/d) Σ_j x_j (mean across d features)

σ² = (1/d) Σ_j (x_j − μ)² (variance across d features)

Step 2 (normalise):

x̂_j = (x_j − μ) / √(σ² + ε)

ε ≈ 1e-5: prevents division by zero when σ² ≈ 0 (e.g., when all entries equal).

Step 3 (learned affine):

y_j = γ_j · x̂_j + β_j

γ ∈ ℝ^d, β ∈ ℝ^d are learned per-feature.

Key: normalisation is per-token (uses statistics of this row only); the affine γ, β are per-feature (same across all tokens). After step 3, the output has whatever mean and variance γ, β specify — the network can choose how to use the normalised representation.

The gradient — why this helps training

The gradient through LayerNorm is more interesting than it looks. Working out ∂L/∂x from ∂L/∂y:

Let z = (x − μ) / σ_ε where σ_ε = √(σ² + ε). Output y = γ ⊙ z + β. Loss derivative coming in: ∂L/∂y. First the easy parts (chain rule directly): ∂L/∂γ_j = Σ_i ∂L/∂y_ij · z_ij (sum across token positions) ∂L/∂β_j = Σ_i ∂L/∂y_ij (sum across token positions) For ∂L/∂x_j (the per-token gradient), we have to be careful because μ and σ both depend on x. Working through (each token row independently): ∂L/∂x_j = (γ_j / σ_ε) · [ ∂L/∂z_j − mean(∂L/∂z) − z_j · mean(∂L/∂z ⊙ z) ] The three terms have intuitive meanings: ∂L/∂z_j — direct contribution from feature j mean(∂L/∂z) — global mean correction (since μ depends on every x_j) z_j · mean(∂L/∂z ⊙ z) — per-feature variance correction (since σ also depends on every x_j)

The two correction terms are what makes LayerNorm a gradient whitener. The mean correction removes the component of the upstream gradient that affects only the mean (the network can’t change the input mean — LayerNorm just subtracts it back out). The variance correction removes the component that affects only the variance. What’s left is the gradient signal that actually changes the direction of x.

The empirical effect: without LayerNorm, transformers above ~6 layers diverge during training — activations explode at intermediate layers. With LayerNorm, the same architecture trains stably at 96+ layers. The “gradient whitening” is what enables the depth that makes large transformers work.

— think, then check —

Without the corrections: if you backprop through the normalisation while pretending μ and σ are constants, you get ∂L/∂x_j ≈ γ/σ · ∂L/∂z_j — basically the naive chain rule treating the normalisation as a linear operation.

What’s wrong: μ and σ are NOT constants; they depend on every entry of x. If you change x_j, you change BOTH the (x_j − μ) numerator AND the σ denominator. The “naive” gradient ignores this and ends up gradient-descending in a direction that doesn’t actually decrease the loss — because part of the apparent gradient is just an artifact of the mean/variance shifting along with x.

What the mean-correction terms do:

  • − mean(g): removes the component of the gradient that would shift the row’s mean. The network can’t change the input mean — LayerNorm will just re-subtract it. So that gradient component is wasted “force.” Subtracting it concentrates the gradient on directions that actually change the output.
  • − z · mean(g ⊙ z): removes the component along the variance direction. The network can’t change the row’s variance either — LayerNorm will rescale. So this component is also wasted.

What’s left is the gradient projection onto the (d − 2)-dimensional subspace ORTHOGONAL to the constant-vector and the input-z direction — the gradient that actually corresponds to changes in the output the network can produce.

Why naive whitening fails empirically: the un-corrected gradient has spurious components that push features in directions LayerNorm will immediately undo. These spurious gradients can be much larger than the “real” gradient (especially when σ is small). Without correction: training is noisy and slow because most of each step is “fixing what LayerNorm already fixed.” With correction: each step moves the loss in a productive direction.

This is why LayerNorm matters more than it looks: it’s not just normalisation in the forward pass; the BACKWARD pass through it is doing free gradient-whitening every step. The same per-token coupling that’s a constraint in the forward pass is a gradient-shaping benefit in the backward pass.

This connects back to Ch.8’s preconditioning intuition: Adam estimates per-parameter scale; SGD with momentum estimates direction. LayerNorm achieves a related effect WITHIN the forward pass by ensuring no single dimension can dominate, and WITHIN the backward pass by removing the gradient components LayerNorm would undo. The three together (Adam + LayerNorm + residual connections [§14.3]) are what make 70B+ parameter transformers trainable end-to-end.

Next: §14.2 — RMSNorm and pre-norm placement. The “drop the mean” simplification that every modern open model uses, and the placement choice (pre-norm before the sublayer vs post-norm after) that affects gradient flow.