Normalization & residuals

§1 LayerNorm — derivation and what it actually does
LayerNorm normalises across the feature dimension of each token: subtract mean, divide by std, scale by γ, shift by β. Unlike BatchNorm, it works at batch size 1 and during inference identically — the right choice for variable-length sequences. The gradient through LayerNorm has a clean form that whitens the loss landscape and makes deep networks trainable.
§2 RMSNorm and pre-norm placement
RMSNorm drops the mean subtraction from LayerNorm — turns out, the mean centering is doing very little work, and removing it saves ~25% of the operation cost. Pre-norm vs post-norm: a placement choice that matters more than it looks. Llama 2 onwards uses pre-RMSNorm; the kernel in this section quantifies the speedup.
§3 The residual stream — why deep networks train
Residual connections (He 2015) added skip paths to deep networks; the gradient now has an identity term flowing all the way back, unblocked by individual layer transformations. In transformers, the residual stream becomes a first-class object: each block reads from it, writes back to it, and the network is best understood as a sequence of additive updates to a single vector. This framing — used by Anthropic for mechanistic interpretability — is also the reason 96-layer transformers train at all.