SOFTMAX & THE EXPONENTIAL FAMILY
Section 12.2
02

Cross-entropy, KL divergence, the loss landscape

Ch.9 §1 derived softmax + cross-entropy’s clean p − y gradient via the canonical-link-function argument. Ch.12 §1 gave us numerically-stable softmax. This section connects the two through information theory. Cross-entropy isn’t an arbitrary choice of loss — it’s the unique loss that says “minimise how surprised the model is by the true label, on average.” The KL divergence framing makes this precise, and unlocks three operational refinements: label smoothing (Szegedy et al. 2016), KL as an explicit regulariser (the K in RLHF — Ch.18), and perplexity as the standard language-model eval metric.

Cross-entropy as a measure of surprise

For a true label distribution y (typically one-hot for classification) and a model distribution p (typically softmax output):

H(y, p) = − Σ_i y_i · log p_i For one-hot y (true class c): H(y, p) = − log p_c ← negative log of the probability the model assigned to the true class

The information-theoretic reading: surprise of seeing an event with probability p is −\log p. Confident correct predictions = low surprise = low loss. Confident wrong predictions = high surprise = high loss. The model is being trained to be unsurprised by the truth.

KL divergence — the natural distance between distributions

The KL divergence between two distributions:

KL(y || p) = Σ_i y_i · log(y_i / p_i) Decompose: KL(y || p) = Σ_i y_i · log y_i − Σ_i y_i · log p_i = −H(y) + H(y, p) = H(y, p) − H(y) So: H(y, p) = KL(y || p) + H(y)

Cross-entropy = KL divergence + entropy of the target. For a fixed target (a fixed training label), H(y) is constant and doesn’t affect the gradient. So minimising cross-entropy is exactly minimising KL divergence to the target distribution. Cross-entropy is the gradient-computable version of “make the model distribution match the target distribution as closely as possible.”

For one-hot y, H(y) = 0 (no entropy in a degenerate distribution), so cross-entropy and KL divergence are equal. For soft targets (label smoothing, next section), they differ by the target’s entropy.

— think, then check —

KL(y || p) = H(y, p) − H(y), where H(y, p) is the cross-entropy and H(y) is the entropy of the target distribution.

For one-hot y: H(y) = 0 → KL(y || p) = H(y, p). Cross-entropy and KL are equal; minimising either is the same.

For soft y (label smoothing, RLHF reference distributions): H(y) > 0 → KL(y || p) = H(y, p) − H(y), a constant offset. The gradient w.r.t. model parameters is the same as cross-entropy’s. But the VALUE of the loss has a non-zero floor — even with a perfect model (p = y), CE bottoms out at H(y) > 0.

Operational consequence: when reporting losses on label-smoothed training, you should subtract H(y) to get the meaningful ‘KL distance’ that compares cross-experiment. Two models trained with different smoothing values will have non-comparable raw cross-entropies but comparable KL values.

Label smoothing — softening the target

The default classification loss uses a hard one-hot target — probability 1 on the true class, 0 elsewhere. Szegedy et al. 2016 (“Rethinking the Inception Architecture for Computer Vision”) proposed softening it: replace the one-hot with a slightly-smoothed distribution that gives the true class probability 1 − ε and distributes ε uniformly across all other classes.

Hard target: y = one_hot(c) [0, 0, …, 1, …, 0] ↑ true class Smoothed: ỹ = (1 − ε) y + ε / C · 1 = [(ε/C), (ε/C), …, (1 − ε + ε/C), …, (ε/C)] Typical ε: 0.1 (so true class gets 0.9 + 0.0008 ≈ 0.901 for C = 128K)

Label smoothing’s effect: the model is penalised for being too confident — confident predictions get high loss even on the correct class because the target says “you should leave some probability mass for the other classes.” Empirically, this regularises the logit magnitudes, reduces overfitting, and slightly improves calibration. Müller et al. 2019 (“When Does Label Smoothing Help?”) showed it’s particularly valuable when the model is over-parameterised relative to the dataset.

For LLM pretraining at vocab 128K, label smoothing’s effect is small (the ε / C contribution to non-true classes is ~10⁻⁶ per token). It’s more impactful in image classification where vocab is small (~1000 classes) and the smoothing meaningfully shifts the target. Modern LLM recipes generally skip it; vision recipes routinely use it.

Perplexity — the eval metric of language modelling

Cross-entropy averaged over a corpus is the standard training signal, but it’s reported in a more interpretable form for evaluation. Perplexity:

PPL = exp( − (1/N) Σ_t log p(x_t | x_{<t}) ) = exp( H(test corpus, model) ) Interpretation: the "effective vocabulary size" of the model's per-token predictions. PPL = 1 (perfect; model assigns probability 1 to every correct token) PPL = vocab_size (worst case; model is uniform over the vocabulary) PPL ≈ 5-15 (typical for a 30B+ LLM on standard test sets)

Why exponentiate? Cross-entropy is in nats; “exp of nats” gives an interpretable count. PPL = 5 means the model is “as uncertain as if it had a fair 5-way die at each token” — much better than 128K-uniform but not perfectly confident.

A 70B-parameter model on standard test sets reports PPL ~5–10. The 4× drop from PPL 40 (2018-era models) to PPL ~5 (2024) over six years is the headline scaling result. Every gain in language modelling cashes out as lower perplexity, and every product capability eventually correlates with it.

KL as a regulariser — the K in RLHF

A different use of KL divergence appears in modern alignment training. In RLHF and its descendants, the model is fine-tuned with an extra loss term that penalises divergence from a reference model:

L_RLHF(θ) = − E[ reward(π_θ output) ] ← maximise reward + β · E[ KL(π_θ || π_ref) ] ← stay close to the SFT baseline (π_θ is the trained policy, π_ref is the SFT reference, β controls the strength of the KL constraint)

The KL term is what prevents reward hacking: the model is encouraged to explore high-reward responses but not stray too far from the SFT baseline’s distribution. β is typically 0.05–0.2; lower values give more aggressive optimisation (and more reward hacking risk); higher values keep the model closer to SFT (less alignment gain).

We’ll cover the full algorithm in Ch.18; for now, the takeaway is that the KL divergence we use to define the loss for classification also appears as an explicit constraint in alignment fine-tuning — same operation, different use case.

— think, then check —

Why it prevents overconfidence: with a hard one-hot target, the optimal logit configuration is to push the true class’s logit toward +∞ and the wrong classes’ logits toward −∞. Cross-entropy keeps going down as the gap grows, with diminishing returns. The model ‘wants’ to be infinitely confident.

With label smoothing, the target says ‘be confident in the true class, but leave some probability mass elsewhere.’ There’s an OPTIMAL finite gap between true-class and other-class logits — pushing further actually increases the loss (because then the small ε/C target mass gets too small a probability). The model converges to a finite logit gap, which empirically generalises better.

Why LLM pretraining mostly skips it: for vocab C = 128K, ε/C = 10⁻⁶ — meaningless relative to actual probabilities. The smoothing essentially adds a tiny constant entropy term to every target distribution but doesn’t meaningfully change the gradient structure. For C = 1000 (typical image-classification setting), ε/C = 10⁻⁴ — meaningful. Label smoothing in vision is regularising; in LLM pretraining it’s essentially a no-op masked by other regularisation (weight decay, dropout, layer norm).

Modern LLM SFT (Ch.18) sometimes uses lightly-smoothed targets to prevent the model from collapsing to argmax-deterministic outputs, but this is recipe-specific rather than universal.

— think, then check —

PPL = 4.2 means the model is effectively choosing among ~4.2 tokens per position. The model’s per-token cross-entropy is log(4.2) ≈ 1.44 nats ≈ 2.07 bits.

Comparisons:

  • Uniform baseline (128K vocab): PPL = 128K. The model is ~30,000× better than guessing uniformly.
  • Character-level baseline (vocab ~256): typical character-level entropy on English is ~1.0 bits/character = ~5 bits/word. At ~4 characters per BPE token, that’s ~8 bits / token = PPL 256 at the BPE level. Our model at PPL 4.2 is ~60× better than character-level information theory predicts as the floor.
  • 2018-era GPT-2 (small, 124M): PPL ~30–40 on WikiText-103. Our model is ~10× better. The improvement has been roughly 4× per year over 2018-2024, slowing recently.
  • Theoretical entropy of English text: Shannon estimated ~1 bit/character ≈ 4 bits/word ≈ PPL 16 at the word level. At the BPE level with 4 chars/token, that’s PPL ~13. We’ve passed the Shannon estimate — the model is more efficient than the original information-theoretic measurement, partly because compositional structure (token-by-token next-token prediction) reduces the surprise of the next token given context.

The decreasing trend in LLM perplexity is asymptotic — we’re approaching an irreducible ‘noise floor’ of human text (typos, contextual ambiguity, rare words). Most product capability improvements past PPL 4 come from longer context, better instruction following (alignment training), and tool use rather than raw next-token prediction accuracy. The PPL race is winding down; the deployment race is still wide open.

END OF CH.12 §2 — Cross-entropy, KL divergence, the loss landscape.
Three recall items: easy (KL vs CE relationship), medium (label smoothing rationale), hard (interpret a 4.2 perplexity against multiple baselines).
Coming next: §12.3 — Online softmax. The block-streaming identity that powers FlashAttention and unlocks Ch.13.