Cross-entropy, KL divergence, the loss landscape

Section 12.2

Cross-entropy, KL divergence, the loss landscape

Ch.9 §1 derived softmax + cross-entropy’s clean p − y gradient via the canonical-link-function argument. Ch.12 §1 gave us numerically-stable softmax. This section connects the two through information theory. Cross-entropy isn’t an arbitrary choice of loss — it’s the unique loss that says “minimise how surprised the model is by the true label, on average.” The KL divergence framing makes this precise, and unlocks three operational refinements: label smoothing (Szegedy et al. 2016), KL as an explicit regulariser (the K in RLHF — Ch.18), and perplexity as the standard language-model eval metric.

Cross-entropy as a measure of surprise

For a true label distribution y (typically one-hot for classification) and a model distribution p (typically softmax output):

H(y, p) = − Σ_i y_i · log p_i For one-hot y (true class c): H(y, p) = − log p_c ← negative log of the probability the model assigned to the true class

The information-theoretic reading: surprise of seeing an event with probability p is −\log p. Confident correct predictions = low surprise = low loss. Confident wrong predictions = high surprise = high loss. The model is being trained to be unsurprised by the truth.

KL divergence — the natural distance between distributions

The KL divergence information theory KL(p || q) = Σ p_i · log(p_i / q_i) — a measure of how much information is 'lost' when q is used to approximate p. Always non-negative; zero iff p = q. NOT symmetric (KL(p || q) ≠ KL(q || p) in general). The natural 'distance' between probability distributions, though not a true metric. Foundational in machine learning, particularly variational inference, generative modelling (VAE, diffusion), and RLHF (where it regularises the policy toward the SFT baseline). between two distributions:

KL(y || p) = Σ_i y_i · log(y_i / p_i) Decompose: KL(y || p) = Σ_i y_i · log y_i − Σ_i y_i · log p_i = −H(y) + H(y, p) = H(y, p) − H(y) So: H(y, p) = KL(y || p) + H(y)

Cross-entropy = KL divergence + entropy of the target. For a fixed target (a fixed training label), H(y) is constant and doesn’t affect the gradient. So minimising cross-entropy is exactly minimising KL divergence to the target distribution. Cross-entropy is the gradient-computable version of “make the model distribution match the target distribution as closely as possible.”

For one-hot y, H(y) = 0 (no entropy in a degenerate distribution), so cross-entropy and KL divergence are equal. For soft targets (label smoothing, next section), they differ by the target’s entropy.

— think, then check —

KL(y || p) = H(y, p) − H(y), where H(y, p) is the cross-entropy and H(y) is the entropy of the target distribution.

For one-hot y: H(y) = 0 → KL(y || p) = H(y, p). Cross-entropy and KL are equal; minimising either is the same.

For soft y (label smoothing, RLHF reference distributions): H(y) > 0 → KL(y || p) = H(y, p) − H(y), a constant offset. The gradient w.r.t. model parameters is the same as cross-entropy’s. But the VALUE of the loss has a non-zero floor — even with a perfect model (p = y), CE bottoms out at H(y) > 0.

Operational consequence: when reporting losses on label-smoothed training, you should subtract H(y) to get the meaningful ‘KL distance’ that compares cross-experiment. Two models trained with different smoothing values will have non-comparable raw cross-entropies but comparable KL values.

↳ §12.2 KL = CE − H

Label smoothing — softening the target

The default classification loss uses a hard one-hot target — probability 1 on the true class, 0 elsewhere. Szegedy et al. 2016 (“Rethinking the Inception Architecture for Computer Vision”) proposed softening it: replace the one-hot with a slightly-smoothed distribution that gives the true class probability 1 − ε and distributes ε uniformly across all other classes.

Hard target: y = one_hot(c) [0, 0, …, 1, …, 0] ↑ true class Smoothed: ỹ = (1 − ε) y + ε / C · 1 = [(ε/C), (ε/C), …, (1 − ε + ε/C), …, (ε/C)] Typical ε: 0.1 (so true class gets 0.9 + 0.0008 ≈ 0.901 for C = 128K)

Label smoothing regularisation technique Replace the one-hot target distribution with a softened version: (1−ε) on the true class, ε/C distributed uniformly across all other classes. Adds an explicit entropy term to the target, which prevents the model from becoming overconfident. Szegedy et al. 2016. Standard in image classification (ε ≈ 0.1) and some NLP recipes; less universal in LLM pretraining where the high vocab makes label smoothing's effect small. Helps generalisation by softly penalising large logit gaps. ’s effect: the model is penalised for being too confident — confident predictions get high loss even on the correct class because the target says “you should leave some probability mass for the other classes.” Empirically, this regularises the logit magnitudes, reduces overfitting, and slightly improves calibration. Müller et al. 2019 (“When Does Label Smoothing Help?”) showed it’s particularly valuable when the model is over-parameterised relative to the dataset.

For LLM pretraining at vocab 128K, label smoothing’s effect is small (the ε / C contribution to non-true classes is ~10⁻⁶ per token). It’s more impactful in image classification where vocab is small (~1000 classes) and the smoothing meaningfully shifts the target. Modern LLM recipes generally skip it; vision recipes routinely use it.

Perplexity — the eval metric of language modelling

Cross-entropy averaged over a corpus is the standard training signal, but it’s reported in a more interpretable form for evaluation. Perplexity:

PPL = exp( − (1/N) Σ_t log p(x_t | x_{<t}) ) = exp( H(test corpus, model) ) Interpretation: the "effective vocabulary size" of the model's per-token predictions. PPL = 1 (perfect; model assigns probability 1 to every correct token) PPL = vocab_size (worst case; model is uniform over the vocabulary) PPL ≈ 5-15 (typical for a 30B+ LLM on standard test sets)

Why exponentiate? Cross-entropy is in nats; “exp of nats” gives an interpretable count. PPL = 5 means the model is “as uncertain as if it had a fair 5-way die at each token” — much better than 128K-uniform but not perfectly confident.

A 70B-parameter model on standard test sets reports PPL ~5–10. The 4× drop from PPL 40 (2018-era models) to PPL ~5 (2024) over six years is the headline scaling result. Every gain in language modelling cashes out as lower perplexity, and every product capability eventually correlates with it.

KL as a regulariser — the K in RLHF

A different use of KL divergence appears in modern alignment training. In RLHF forward ref Reinforcement Learning from Human Feedback. The training process that turns a pretrained base LLM into a chat assistant. Three stages: (1) supervised fine-tuning (SFT) on demonstration data; (2) train a reward model on human preferences over completions; (3) optimise the SFT model via PPO (or DPO, GRPO) using the reward model as the objective, with a KL-divergence regulariser to the SFT baseline. The KL term prevents the policy from drifting too far from the SFT distribution. Covered in Ch.18. and its descendants, the model is fine-tuned with an extra loss term that penalises divergence from a reference model:

L_RLHF(θ) = − E[ reward(π_θ output) ] ← maximise reward + β · E[ KL(π_θ || π_ref) ] ← stay close to the SFT baseline (π_θ is the trained policy, π_ref is the SFT reference, β controls the strength of the KL constraint)

The KL term is what prevents reward hacking failure mode When a model trained against a reward signal finds inputs that score very high but don't match the original intent — exploiting biases or quirks in the reward model. Common in RLHF when the KL constraint is too weak: the policy diverges far from the SFT baseline and the reward model's training distribution, leading to outputs that score well on the (now-out-of-distribution) reward model but are nonsensical or harmful in context. The KL regulariser limits this by keeping the policy close to the SFT baseline. : the model is encouraged to explore high-reward responses but not stray too far from the SFT baseline’s distribution. β is typically 0.05–0.2; lower values give more aggressive optimisation (and more reward hacking risk); higher values keep the model closer to SFT (less alignment gain).

We’ll cover the full algorithm in Ch.18; for now, the takeaway is that the KL divergence we use to define the loss for classification also appears as an explicit constraint in alignment fine-tuning — same operation, different use case.

— think, then check —

Why it prevents overconfidence: with a hard one-hot target, the optimal logit configuration is to push the true class’s logit toward +∞ and the wrong classes’ logits toward −∞. Cross-entropy keeps going down as the gap grows, with diminishing returns. The model ‘wants’ to be infinitely confident.

With label smoothing, the target says ‘be confident in the true class, but leave some probability mass elsewhere.’ There’s an OPTIMAL finite gap between true-class and other-class logits — pushing further actually increases the loss (because then the small ε/C target mass gets too small a probability). The model converges to a finite logit gap, which empirically generalises better.

Why LLM pretraining mostly skips it: for vocab C = 128K, ε/C = 10⁻⁶ — meaningless relative to actual probabilities. The smoothing essentially adds a tiny constant entropy term to every target distribution but doesn’t meaningfully change the gradient structure. For C = 1000 (typical image-classification setting), ε/C = 10⁻⁴ — meaningful. Label smoothing in vision is regularising; in LLM pretraining it’s essentially a no-op masked by other regularisation (weight decay, dropout, layer norm).

Modern LLM SFT (Ch.18) sometimes uses lightly-smoothed targets to prevent the model from collapsing to argmax-deterministic outputs, but this is recipe-specific rather than universal.

↳ §12.2 label smoothing

— think, then check —

PPL = 4.2 means the model is effectively choosing among ~4.2 tokens per position. The model’s per-token cross-entropy is log(4.2) ≈ 1.44 nats ≈ 2.07 bits.

Comparisons:

Uniform baseline (128K vocab): PPL = 128K. The model is ~30,000× better than guessing uniformly.
Character-level baseline (vocab ~256): typical character-level entropy on English is ~1.0 bits/character = ~5 bits/word. At ~4 characters per BPE token, that’s ~8 bits / token = PPL 256 at the BPE level. Our model at PPL 4.2 is ~60× better than character-level information theory predicts as the floor.
2018-era GPT-2 (small, 124M): PPL ~30–40 on WikiText-103. Our model is ~10× better. The improvement has been roughly 4× per year over 2018-2024, slowing recently.
Theoretical entropy of English text: Shannon estimated ~1 bit/character ≈ 4 bits/word ≈ PPL 16 at the word level. At the BPE level with 4 chars/token, that’s PPL ~13. We’ve passed the Shannon estimate — the model is more efficient than the original information-theoretic measurement, partly because compositional structure (token-by-token next-token prediction) reduces the surprise of the next token given context.

The decreasing trend in LLM perplexity is asymptotic — we’re approaching an irreducible ‘noise floor’ of human text (typos, contextual ambiguity, rare words). Most product capability improvements past PPL 4 come from longer context, better instruction following (alignment training), and tool use rather than raw next-token prediction accuracy. The PPL race is winding down; the deployment race is still wide open.

↳ §12.2 perplexity

END OF CH.12 §2 — Cross-entropy, KL divergence, the loss landscape.
Three recall items: easy (KL vs CE relationship), medium (label smoothing rationale), hard (interpret a 4.2 perplexity against multiple baselines).
Coming next: §12.3 — Online softmax. The block-streaming identity that powers FlashAttention and unlocks Ch.13.