NORMALIZATION & RESIDUALS
Section 14.2
02

RMSNorm and pre-norm placement

Zhang & Sennrich 2019 “Root Mean Square Layer Normalization” noticed something interesting: most of LayerNorm’s benefit comes from the variance normalisation, not the mean centering. Drop the mean subtraction and you save the per-row mean computation (one pass) plus the per-row subtraction. The result — RMSNorm — is computationally cheaper, empirically just as good or marginally better, and is now the default in every major open model (Llama 2/3, Mistral, Qwen, DeepSeek). This section covers the math, the wall-clock difference, and the closely related question of WHERE in the block to put the norm — pre-norm (before the sublayer) vs post-norm (after).

Dropping the mean centering

LayerNorm: y = γ · (x − μ) / √(σ² + ε) + β.

RMSNorm: y = γ · x / √(mean(x²) + ε).

Two changes:

  1. No mean subtraction. The input is divided by the RMS (root-mean-square) of its features, not by the standard deviation of (x − mean).
  2. No β. Just γ. The bias term is dropped — empirically it learns near-zero anyway.
LayerNorm: RMSNorm: μ = mean(x) RMS = √( mean(x²) + ε ) σ² = var(x) σ_ε = √(σ² + ε) y = γ · (x − μ) / σ_ε + β y = γ · x / RMS Operations per token (for d-dim input): LayerNorm: ~4d (sum for μ, sum of (x−μ)² for σ², subtract μ, multiply by γ, add β) RMSNorm: ~3d (sum of x² for RMS, multiply by γ) Roughly 25% fewer ops, no double-pass over x for mean-then-variance.

The key empirical observation in Zhang & Sennrich’s paper: removing the mean centering doesn’t hurt accuracy on the suite of language modelling benchmarks they tested. Subsequent work (PaLM, Llama family, etc.) confirmed this at scale. The mean of the activation vector ends up being learned away by γ effectively, OR isn’t being meaningfully used by downstream layers.

RMSNorm is the default in modern transformers because: (1) ~25% speed win on the norm op itself, which adds up across hundreds of layers; (2) one fewer parameter set (no β); (3) no measurable accuracy loss.

Wall-clock comparison

The kernel runs both on identical 64-token, 512-feature inputs and times them:

norms.c — rmsnorm C · norm comparison
{
    const float eps = 1e-5f;
    for (int i = 0; i < N; i++) {
        const float* x = X + (size_t)i * D;
        float* y = Y + (size_t)i * D;
        /* RMS */
        float sumsq = 0;
        for (int j = 0; j < D; j++) sumsq += x[j] * x[j];
        float inv = 1.0f / sqrtf(sumsq / D + eps);
        /* affine — scale only */
        for (int j = 0; j < D; j++) y[j] = g[j] * (x[j] * inv);
    }
}

Output:

LayerNorm  output (γ=1, β=0):  max |row mean|     = 1.44e-06   (should be ~0)
LayerNorm  output (γ=1, β=0):  max |row var − 1|  = 3.28e-06   (should be ~0)
RMSNorm    output (γ=1):       max |row RMS  − 1| = 8.94e-07   (should be ~0)

Wall-clock over 5000 reps, N=64 D=512:
  LayerNorm:  207.859 ms  (1.54 Mtokens/s)
  RMSNorm:     87.901 ms  (3.64 Mtokens/s)
  RMSNorm speedup over LN: 2.36x

2.36× faster end-to-end on this CPU benchmark. On GPU, the gap is smaller (norm ops are memory-bound, and both ops touch the same memory once) — typically 1.2-1.5×. But even at 1.2×, over hundreds of layers per forward pass, that adds up.

— think, then check —

RMSNorm: y = γ · x / √(mean(x²) + ε).

Dropped from LayerNorm:

  • The mean subtraction (x − μ): RMSNorm divides x by its RMS, not by σ of (x − μ).
  • The bias β: the affine becomes a pure scale γ, no shift.

Empirical justification: Zhang & Sennrich tested on transformer LM and machine translation benchmarks; RMSNorm matched LayerNorm accuracy across the board. Re-mean-centering inside the norm op turned out to be redundant — the mean is either learned away by γ, or is small enough at typical activation distributions that subtracting it doesn’t change downstream computations meaningfully. Llama 1+, Mistral, PaLM, Qwen all use RMSNorm; the empirical case is now overwhelming.

Cost win: ~25% fewer FLOPs per norm op (one fewer reduction over x, one fewer subtraction per element, one fewer add for β). Over hundreds of layers, this is 1.2-2.4× speedup depending on whether you’re compute-bound (CPU) or memory-bound (GPU).

Pre-norm vs post-norm placement

The other axis: WHERE in the residual block does the norm go?

Post-norm (original Vaswani 2017): x_out = LayerNorm( x_in + Sublayer(x_in) ) Pre-norm (every modern model): x_out = x_in + Sublayer( LayerNorm(x_in) ) The difference: post-norm normalises AFTER the residual addition. Pre-norm normalises BEFORE the sublayer (attention or FFN), and the residual addition happens at the end without re-normalising.

The choice matters for training stability at depth.

Post-norm applies the norm after the residual. Pre-norm applies it before, leaving the residual path unnormalised.

Why pre-norm trains better: look at the backward pass through L stacked blocks.

Post-norm gradient flow through L blocks: x_in → LN(x_in + S₁(x_in)) → LN(prev + S₂(prev)) → ... ∂loss/∂x_in includes a chain of L norm-derivatives (γ/σ) factors. Each LN has σ that depends on activations. Small variance fluctuations compound through the chain — gradients can vanish or explode. Pre-norm gradient flow through L blocks: x_in → x_in + S₁(LN(x_in)) → prev + S₂(LN(prev)) → ... ∂loss/∂x_in has an IDENTITY path through every block (the residual). The norm is in the SUBLAYER PATH, not the residual path. Gradient = identity_path_grad + sublayer_path_grad = clean_signal + bounded_correction. No multiplicative chain through the norm; the network can scale gradients freely along the residual path.

The pre-norm placement makes the residual stream a “highway” — gradients flow through it without being modulated by per-layer norm derivatives. That highway is what makes 32+ layer transformers train at all. §14.3 covers the residual stream as a first-class object.

— think, then check —

Placement:

Post-norm: x_out = LN(x_in + Sublayer(x_in)) — the norm is the LAST op, applied to the post-residual sum.

Pre-norm: x_out = x_in + Sublayer(LN(x_in)) — the norm is in the SUBLAYER PATH, and the residual addition happens at the end without being normalised.

What changes:

  • The residual path. In pre-norm, the residual carries x_in directly to the output of the block, unnormalised. In post-norm, the LN wraps both the residual and sublayer outputs together.
  • Gradient flow. Pre-norm has a clean identity gradient path through every block (∂x_out/∂x_in includes a 1 term from the residual). Post-norm has a γ/σ factor from the wrapping LN multiplying everything, including the residual contribution.

Why pre-norm trains better at depth:

At 24+ layers, post-norm’s chain of γ/σ factors causes gradients to either vanish (if σ tends to grow) or explode (if σ tends to shrink). Pre-norm preserves identity gradient flow through the residual, so the depth-direction signal is bounded by the identity (gradient = 1 from residual + small bounded correction from sublayer).

The empirical consequence: post-norm requires careful learning-rate warmup (linear from 0 → max over ~10K steps) to avoid divergence. Pre-norm can use simpler schedules and trains stably at 96+ layers (PaLM, GPT-4-class models).

The cost: pre-norm produces residual streams that grow in magnitude with depth (each block adds to x without re-normalising). This is mitigated by a final LayerNorm before the output layer, or by careful initialisation of the sublayers (e.g., scaling sublayer outputs by 1/√L for L layers).

What modern models actually use

A quick survey across the major open models:

Model Norm Placement GPT-2 (2019) LayerNorm pre-norm GPT-3 (2020) LayerNorm pre-norm T5 (2020) RMSNorm pre-norm (one of the first to adopt RMS) Llama 1 (2023) RMSNorm pre-norm Llama 2 (2023) RMSNorm pre-norm Llama 3 (2024) RMSNorm pre-norm Mistral 7B (2023) RMSNorm pre-norm Qwen 2/3 RMSNorm pre-norm DeepSeek V3 RMSNorm pre-norm Gemma RMSNorm pre-norm

The choice has converged. RMSNorm in pre-norm position is the default; every paper proposing changes has to justify why their alternative beats this baseline.

— think, then check —

Three independent reasons:

  1. Compute / memory efficiency. RMSNorm is ~25% fewer ops than LayerNorm and 1 fewer parameter set (no β). At 32+ layers × forward + backward × every training step, this compounds. Pre-norm placement avoids re-normalisation of the residual addition, saving an extra norm per block in some accountings.
  2. Training stability at depth. Pre-norm gives clean identity-gradient paths through the residual stream (§14.3). No γ/σ chain to amplify or vanish gradients across 32+ layers. Post-norm requires hand-tuned warmup; pre-norm trains stably from initialisation with standard cosine schedules.
  3. No empirical accuracy loss. RMSNorm matches LayerNorm on perplexity and downstream eval; pre-norm matches post-norm at the same depth (slightly worse at very small depths ≤ 6 but better past 12). For the depths real LLMs use (32-96 layers), the choice is dominated.

Trade-offs accepted:

  • Residual stream magnitude grows with depth. Without re-normalisation, x_in + Sublayer(LN(x_in)) is bigger than x_in. Across L blocks, the magnitude can grow O(√L). Mitigated by a final LayerNorm before the output projection, and by initialising sublayer outputs at small scale (often 1/√L).
  • Loss of mean centering (RMSNorm). If activations have a nonzero mean, RMSNorm doesn’t remove it — the network has to learn to deal with it. Empirically this doesn’t hurt; activations have ~zero mean by mid-training anyway.
  • Loss of β (RMSNorm). The affine has scale but no shift. Less expressive in principle; empirically this is fine because the immediately preceding linear layer (in the FFN or attention) can absorb any shift into its bias.

The trade-offs are minor; the wins (speed, stability, fewer params) are major. The convergence of every modern model on this combination is strong empirical evidence that the design space has been searched and this is the optimum for current architectures.

Next: §14.3 — The residual stream. Why deep networks train at all, and how the residual path makes the transformer a “highway with sublayers reading and writing to it.” This is the framing that powers mechanistic interpretability and the right way to think about what each layer is doing.