Softmax & the exponential family

§1 Softmax — properties + numerical stability
Softmax turns a vector of real-valued logits into a probability distribution. Looks simple; has two operational landmines (overflow and underflow) and one elegant fix (max-subtraction). Every production softmax implementation uses the same three-line numerical-stability trick.
§2 Cross-entropy, KL divergence, the loss landscape
Cross-entropy isn't arbitrary — it's the information-theoretic measure of how surprised the model is by the true label. The CE loss = KL divergence to the one-hot target + a constant. Three modern refinements: label smoothing, KL as an explicit regulariser (RLHF), and perplexity as the standard LM eval metric.
§3 Online softmax — the FlashAttention key
Softmax over a long sequence can be computed by processing BLOCKS of the input and carrying two running scalars (max m, sum-of-exps ℓ) per row. The result is bit-equal to naïve full-batch softmax, regardless of block size. That identity — derived in this section, validated against ground truth in the kernel — is the algorithmic foundation of FlashAttention (Ch.13).