Sampling — from logits to tokens

Section 15.3

Sampling — from logits to tokens

The forward pass ends with logits — a real-valued vector over the vocabulary. Softmax turns these into a probability distribution. But generation requires picking ONE token to append to the running context. How? The simplest answer (argmax) produces deterministic, repetitive, often boring output. Real systems use temperature-modified softmax, optionally truncated by top-k or top-p (nucleus) sampling, sometimes with beam search for “best of N” candidates, and increasingly with speculative decoding (Ch.S) for throughput. This section covers each strategy and the trade-offs.

Greedy: argmax

The simplest sampler:

Greedy: t_next = argmax_i logits_i

Always pick the most likely token. Deterministic. Reproducible. Useful for evaluation, code generation, and any setting where you want exactly one canonical output.

The cost: repetition and degenerate output. Once the model gets into a loop (“the cat the cat the cat…”), greedy can’t break out — the argmax keeps picking the same continuation. For long-form generation, greedy is widely considered worse than even simple stochastic sampling.

Temperature: sharpen or flatten the distribution

Temperature: t_next ∼ softmax(logits / T) T = 1: the unmodified softmax distribution. T < 1: divides logits by a small number → logit gaps grow → softmax sharpens. At T → 0+, softmax converges to argmax (greedy). T > 1: divides logits by a large number → logit gaps shrink → softmax flattens. At T → ∞, softmax converges to uniform.

Temperature is the simplest control over diversity. Practical typical values:

T = 0.1-0.3 for deterministic tasks (code, factual QA).
T = 0.7-1.0 for chat (the default for ChatGPT, Claude is in this range).
T = 1.2-1.5 for creative writing and brainstorming.

— think, then check —

T = 0.1: Logits are divided by a tiny number, multiplying their magnitudes by 10. Already-likely tokens become VERY likely (exp(logit/0.1) is exponentially larger for the top token). The distribution becomes very concentrated — close to greedy but with a small chance of picking the runner-up.

T = 1: Identity — the unmodified softmax. The “natural” distribution the model learned to produce.

T = 2: Logits are halved. exp() squashes the differences. The distribution is flatter — the top token might be 40% instead of 70%; the tail gets meaningful probability mass.

T → 0+: All probability mass concentrates on the argmax. softmax converges to a one-hot at the highest-logit position. This is the limit of greedy.

T → ∞: All logits become equal (logit / large_number → 0); softmax becomes uniform. Every token equally likely. Random sampling regardless of model.

The temperature is a SOFTNESS knob, not a quality knob. Higher T = more diversity, less coherence; lower T = more coherence, less diversity. The right value depends entirely on the task.

↳ §15.3 temperature

Top-k: throw away the long tail

Even at T = 1, the softmax distribution has a long thin tail of unlikely tokens. Sampling from the full distribution sometimes picks a token with 0.0001 probability — which can derail a generation.

Top-k sampling: 1. Keep only the k logits with the highest values. 2. Set all other logits to −∞. 3. Apply softmax to the remaining k logits. 4. Sample from this truncated distribution. Typical k = 40-50. Pros: bounded set of candidates; never picks from the long thin tail. Cons: k is fixed regardless of distribution shape — if the model is very confident (one token at 95%), k=50 includes too many; if uncertain (50 tokens at 2% each), k=50 truncates meaningful candidates.

Top-k caps the candidate set at k. Simple, fast, effective. The main weakness: k is a constant, but the distribution shape varies wildly across timesteps — sometimes the model is very confident (k=2 would suffice); sometimes uncertain (k=100 might still cut too much).

Top-p (nucleus): adaptive truncation

Holtzman 2019 “The Curious Case of Neural Text Degeneration” proposed an adaptive alternative:

Top-p (nucleus) sampling: 1. Sort tokens by probability, descending. 2. Take the smallest set whose cumulative probability is ≥ p. 3. Set all other tokens' probability to 0. 4. Renormalise and sample from the nucleus. Typical p = 0.9-0.95. The "nucleus" size varies per timestep: - If the model is very confident, the nucleus might be 1-3 tokens. - If uncertain, the nucleus might be 50+ tokens. The truncation cutoff adapts to the distribution's shape.

Top-p (also called nucleus sampling) usually outperforms top-k empirically because it adapts to the distribution. Modern chat models (Claude, GPT, Llama) typically use top-p in the 0.9-0.95 range.

Now make it run

The C kernel runs all five strategies on a realistic logit distribution (Zipfian shape with three peaks) over 100K samples each:

sampling.c — sample_topp (nucleus) C · five samplers, 100K trials each


/* Strategy 4: top-p (nucleus) = softmax, take smallest set with cum prob > p, sample */
static int sample_topp(const float* logits, float p_cutoff) {
    float p[V];
    memcpy(p, logits, V * sizeof(float));
    softmax(p, V);
    /* sort indices by p descending */
    int idx[V];
    for (int i = 0; i < V; i++) idx[i] = i;
    for (int i = 0; i < V; i++)
        for (int j = i + 1; j < V; j++)
            if (p[idx[j]] > p[idx[i]]) { int t = idx[i]; idx[i] = idx[j]; idx[j] = t; }
    /* nucleus: smallest set with cum > p_cutoff */
    float cum = 0;
    int nucleus_size = 0;
    for (int i = 0; i < V; i++) {
        cum += p[idx[i]];
        nucleus_size = i + 1;
        if (cum >= p_cutoff) break;
    }
    /* zero out non-nucleus, renormalise */
    float zero_mask[V];
    for (int i = 0; i < V; i++) zero_mask[i] = 0.0f;
    for (int i = 0; i < nucleus_size; i++) zero_mask[idx[i]] = p[idx[i]];

Output (truncated):

Sampling from a 32-token vocab, 100000 trials each.
Logit peaks: tok 2 (6.0), tok 5 (4.5), tok 11 (3.5).

Raw softmax probs (top 8): tok 2:0.79  tok 5:0.18  tok 11:0.02  tok 17:0.00  ...

Greedy (argmax) — top 8 token frequencies:
   tok=2    100.00%  ██████████████████████████████████████████████████

Temperature T=0.5 (sharper):
   tok=2     94.62%
   tok=5      4.72%
   tok=11     0.62%

Temperature T=1.5 (flatter):
   tok=2     61.38%
   tok=5     22.43%
   tok=11    11.59%
   tok=17     4.20%

Top-k (k=5):
   tok=2     75.61%
   tok=5     16.87%
   tok=11     6.19%
   tok=17     1.32%

Top-p (p=0.9):
   tok=2     81.72%
   tok=5     18.28%        ← nucleus contained just 2 tokens here (high confidence)

Notice the differences:

Greedy always picks tok 2; no diversity at all.
T = 0.5 is close to greedy (95% on tok 2) but allows tok 5 occasionally.
T = 1.5 flattens enough that the long-tail tokens (17, 0, 1, 3) get meaningful mass.
Top-k=5 allows the 5 top logit tokens, but the lower 3 of these are still rare in the original distribution.
Top-p=0.9 adapted to a 2-token nucleus here because the top two tokens already cover 97% of mass; everything else got 0%.

— think, then check —

Top-p (nucleus) is generally preferred for chat-quality output. Reason: it adapts to the distribution shape.

How they differ:

Top-k=50 always allows the top 50 tokens to be candidates. If the distribution is sharply peaked (one token at 95% probability), it still allows 49 other tokens — most of which are unlikely. If the distribution is flat (50 tokens at 2% each), it cuts at 50 — but maybe should allow more.

Top-p=0.95 takes the smallest set with cum probability ≥ 0.95. Sharp distribution: nucleus has 1-3 tokens. Flat distribution: nucleus has many tokens. Adapts.

Failure modes:

Top-k fails when k is too large for confident distributions: lets in unlikely tokens, causing “derailment” where the generation goes off-topic.
Top-k fails when k is too small for uncertain distributions: cuts off legitimate alternatives, forcing repetition.
Top-p fails when p is too high (close to 1): includes too much tail, allowing rare tokens.
Top-p fails when the distribution is “bimodal” in the wrong way: a high-confidence token + a lot of similarly-low-probability tokens. p might include the wrong subset.

Hybrid approach (often used in production): apply temperature first, then BOTH top-k and top-p. Take the intersection of “in top-k AND in top-p nucleus.” This catches the failure modes of both: top-k caps the candidate set at a reasonable max; top-p adapts within that cap.

The typical chat-LLM setting: T = 0.7, top-p = 0.9, top-k = 50. Conservative diversity with adaptive cutoff and a max-candidates safety bound.

↳ §15.3 sampling strategies

Beam search: explore multiple continuations

Beam search keeps the K most-likely partial sequences at each step, expanding all of them, and keeping the K most-likely of the resulting K · V candidates.

Beam search with beam width K: Initialise: K candidate sequences, each = the input prompt. At each step: For each candidate, run forward pass → logits. Generate K · V candidate next-sequences (each candidate × each vocab token). Score each by Σ log p(token_i | tokens_<i). Keep top K of K · V. Stop when all K candidates have ended with </s>.

Beam search produces higher-likelihood sequences than greedy or stochastic sampling. The trade-off: it’s K × slower (K parallel forward passes per step), produces SHORTER sequences by default (longer sequences have lower joint probability), and typically produces less diverse and more “vanilla” output. Used heavily in translation and summarization (where you want the best-likelihood translation, not a creative one); rarely used in chat (where “best likelihood” doesn’t equal “best chat”).

Speculative decoding (preview of Ch.S)

The newest entry: speculative decoding doesn’t change WHAT you sample, only HOW FAST. A small draft model proposes K tokens; the big target model verifies all K in parallel. For typical models, expected accepted tokens per verify ≈ 2-3, so the same target model can generate 2-3× faster wall-clock at no quality cost. Covered in detail in Ch.S; here just noting that it’s orthogonal to the temperature/top-p choice — speculative decoding works with any sampling strategy.

— think, then check —

(a) Coding assistant: Temperature 0.1-0.3 + top-p 0.9 (or top-k 20).

Reason: code has very low tolerance for “creative” variation. The right answer is usually the highest-probability one (modulo a few stylistic alternatives). Low temperature keeps the output close to argmax; top-p still allows breaking out of repetition when the model picks two equally good completions. Failure modes accepted: occasional “stuck in a loop” if temperature too low; lack of creativity if the user wanted alternative implementations.

(b) Creative writing: Temperature 1.0-1.2 + top-p 0.95.

Reason: creative output values diversity. Temperature > 1 flattens the distribution, allowing surprising word choices. Top-p caps the tail at 95% to prevent total derailment. Failure modes: occasional gibberish if the temperature is too high; occasional bland output if too low.

(c) Factual QA: Temperature 0 (greedy) or 0.1-0.2.

Reason: factual answers should be deterministic — the same question should yield the same answer. Temperature 0 makes the output reproducible. Failure modes: if the model’s argmax is wrong, you’ll always get the wrong answer (no random chance of picking the right answer); the model might also pick the “most common but slightly wrong” answer over the “less common but correct” one.

Why beam search is wrong for all three:

Beam search produces high-likelihood sequences, but high-likelihood ≠ high-quality for chat. The “most likely” continuation of “The answer to your question is…” is often a vague hedge. Empirically, beam search produces text that scores higher in N-gram metrics (BLEU, ROUGE) but lower in human preference. Used in translation (high-precision target); abandoned in chat.

The wider lesson: sampling is a workload-specific design choice. There’s no single “best” sampler. Production systems often expose temperature/top-p as user-facing parameters, with documentation showing the trade-offs.

↳ §15.3 + production deployment

END OF CH.15 — The GPT architecture, end-to-end.
§1 (decoder-only stack: Llama 2 7B forward pass, parameter count, SwiGLU) · §2 (encoder vs decoder vs encoder-decoder: training signal density, why decoder-only won) · §3 (sampling: greedy / temperature / top-k / top-p / beam, with kernel comparing all five).

Part IV opens. Next chapter: Ch.16 — Pretraining. The “predict next token” objective at trillion-token scale. Chinchilla scaling laws. The data pipeline. The token budget.