INFERENCE AT SCALE
Section 23.3
03

Speculative decoding — production deep dive

In §23.2 we established that decode is memory-bandwidth-bound. The H100’s ~1 PFLOPs of compute sits 90%+ idle during decode because HBM can’t feed weights fast enough. Speculative decoding (Leviathan 2022 “Fast Inference from Transformers via Speculative Decoding”; Chen 2023 “Accelerating Large Language Model Decoding with Speculative Sampling”) exploits this gap: a small DRAFT model proposes K tokens quickly; the target model VERIFIES all K in parallel in a single forward pass. The verification uses the same HBM bandwidth that would have been used for one token — but produces 1-K accepted tokens. Combined with continuous batching and PagedAttention, this is the third pillar of modern LLM serving. The supplementary Ch.S worked through the math; this section frames it for production deployment.

Why speculative decoding works at all

The structural insight:

Decode bottleneck recap (Ch.21): For Llama 3 70B in fp16: 140 GB of weights. H100 bandwidth: 3.35 TB/s. Per-token bandwidth: 140 GB / token → ~42 ms per token at peak BW. Per-token compute: ~150 GFLOPs / 1 PFLOPs = 0.15 ms at peak compute. So compute is ~280× underutilised. Time is in bandwidth, not FLOPs. Speculative decoding insight: If we could VERIFY K candidate tokens in ONE forward pass — using ONE weight load — we'd get K tokens for the bandwidth cost of one. The verification IS just one forward pass! Feed K input tokens, get K output distributions, check each against the candidate. If the draft is right K/K times: K× speedup. If the draft is right K-1 / K times on average: still significant gains.

The trick is the “parallel verification.” Normally autoregressive generation is sequential: you can’t compute token N+2 without first knowing token N+1 (which depends on the model’s actual output). Speculative decoding bypasses this by GUESSING N+1, then computing the model’s true distributions at all positions in parallel, and accepting the guess if it matches.

The algorithm

Speculative decoding for one round: Setup: - Target model M (large): the one we want to deploy. - Draft model M' (small, cheap): a fast approximator, often a distilled M. - K = number of draft tokens per round (typically 4-8). Step 1 — DRAFT: Run M' autoregressively for K steps from the current context. Generates K candidate tokens t'_1, t'_2, ..., t'_K and their probabilities q_1(·), q_2(·), ..., q_K(·). Step 2 — VERIFY: Run M ONCE on the input [context, t'_1, t'_2, ..., t'_K]. Returns the M's distribution at each position: p_1(·), p_2(·), ..., p_K(·). This is ONE forward pass (parallel over the K positions). Step 3 — REJECTION SAMPLING: For i = 1, 2, ..., K: Sample u ~ Uniform[0, 1]. Compute acceptance ratio: r = p_i(t'_i) / q_i(t'_i) If u < r: ACCEPT t'_i; continue to i+1. Else: REJECT t'_i. Sample t_i ~ p_i (resampled from a "corrected" distribution). Break (don't process tokens i+1..K). Step 4 — BONUS: If all K tokens accepted: BONUS — sample one more token t_{K+1} ~ p_{K+1}. Total accepted: K+1 tokens per round. Result: 1 to K+1 tokens accepted per verification round.

Speculative decoding is provably correct: the accepted tokens are drawn from M’s exact distribution, not an approximation. The math (worked through in Ch.S) shows the rejection sampling preserves M’s distribution because it accounts for the discrepancy between q and p.

Expected acceptance

How many tokens do you expect to accept per round? Depends on how well M’ approximates M:

Expected acceptance per round (mathematical): Let α_i = acceptance rate at position i. E[accepted | K rounds] = α + α² + α³ + ... + α^K + α^(K+1) (geometric series) = α · (1 - α^(K+1)) / (1 - α) For typical draft models: α ≈ 0.7-0.85 (draft agrees with target ~75% of the time per token) With K = 4: E[accepted] = 0.8 · (1 - 0.8^5) / (1 - 0.8) ≈ 2.7 tokens per round With K = 8: E[accepted] ≈ 3.0 tokens per round (diminishing returns past K=4-6) Effective speedup vs no speculation: Per round we do: 1 target forward pass + K draft forward passes. The K draft forward passes are CHEAP (small model, maybe 10× faster). So per round wall-clock: ~1.5 target-pass equivalents. Per round acceptance: ~2.7 tokens. Speedup: 2.7 / 1.5 ≈ 1.8× to 2.5× For really good drafts (α ≥ 0.9, e.g., MTP): Speedup can be 3-5×.

The kernel from Ch.S (the speculative decoding simulator) demonstrates this empirically: at α=0.8 and K=4, expected accepted tokens ≈ 2.7 per round.

Production variants

Three flavors of speculative decoding ship in production systems:

1. Standard speculative (Leviathan 2022): - Separate small draft model. - Pros: any small model from the same family works (e.g., Llama-1B drafts for Llama-70B). - Cons: two model copies in memory; the draft model is "free" only if you have a smaller member of the same family already on disk. 2. MTP (Multi-Token Prediction — DeepSeek V3, Qwen 3.6): - Train the target model with EXTRA OUTPUT HEADS that predict tokens at offset +1, +2, ..., +K. - At inference: the K heads produce the K candidates in ONE forward pass (no separate draft model). - Pros: zero training overhead (auxiliary loss helps generalisation); heads ship with the model. - Cons: requires retraining/redistributing the model. 3. EAGLE (Li 2024): - Small autoregressive head over the target's penultimate-layer FEATURES. - Higher acceptance than MTP (~0.85+) because features are smoother than discrete tokens. - Pros: highest acceptance rate of any non-MTP scheme. - Cons: most complex; requires careful uncertainty handling; trained head needs to learn the feature space. What's shipping where (2024-2025): vLLM: standard speculative + medusa + EAGLE (configurable). llama.cpp: standard speculative + MTP (PR #22673 added MTP support in 2024). TensorRT-LLM: standard speculative + custom variants. DeepSeek V3: MTP built-in (no separate draft model needed). Qwen 3.6: MTP at inference.
— think, then check —

Setup:

Target distribution p(t). Draft proposes t’ with distribution q(t’). We want to ACCEPT t’ such that the resulting accepted samples follow p (not q).

Acceptance rule:

r(t’) = p(t’) / q(t’).

Sample u ~ Uniform[0, 1].

Accept if u < min(1, r(t’)).

Why this gives samples from p:

The probability that a SAMPLED t’ (from q) is accepted is:

P(accept | t’) = min(1, p(t’) / q(t’)).

So the probability of a specific token t’ ending up in the output is:

P(t’ in output) = q(t’) · P(accept | t’) = q(t’) · min(1, p(t’)/q(t’)) = min(q(t’), p(t’)).

If p(t’) < q(t’): we keep p(t’) worth of mass (= min = p(t’)). The “extra” q(t’) - p(t’) is rejected.

If p(t’) > q(t’): we keep ALL q(t’) worth (= min = q(t’)). But we need more mass at this token (since p > q). That’s where REJECTION SAMPLING comes in.

The rejection branch:

If t’ is rejected, we sample from a CORRECTED distribution:

p’(t) = max(0, p(t) - q(t)) / Z where Z is a normaliser.

This puts extra mass on the tokens where p > q (which q under-sampled).

The total marginal distribution of the accepted token:

P(t in output) = P(t from accepted path) + P(t from rejection path)

= q(t) · min(1, p(t)/q(t)) + P(rejection) · p’(t)

After algebra: = p(t).

The token’s marginal distribution in the output EXACTLY equals p(t). The accepted tokens are drawn from the target’s true distribution, despite using the draft for proposal.

The intuition:

For tokens where q approximates p well: most proposals accepted (~100%).

For tokens where q is too HIGH: some proposals rejected → resampled from corrected p’.

For tokens where q is too LOW: q under-proposes → corrected by the resampling step.

Net: the output is exactly p, but with most computation done by the cheap q.

Why this matters for LLM serving:

It means speculative decoding is a “free” speedup — same model behaviour, same probabilities, same temperature, just faster. No quality trade-off. This is why everyone adopted it.

When speculative decoding helps the least

The honest accounting: speculative decoding doesn’t help equally for all workloads.

Speedup as a function of workload: Batch size = 1, decode-only: Highest speedup (2.5-3×). Decode is most bandwidth-bound; speculation saves the most. Batch size = 8-16: Moderate speedup (1.5-2×). Decode is still memory-bound but less acutely. Batch size = 64+: Marginal speedup (1.1-1.3×). At large batch size, decode becomes compute-bound. Speculation's "save bandwidth" advantage diminishes. Prefill: NO speedup. Prefill is already compute-bound and parallel. Speculation doesn't apply. GPU utilisation matters: If your serving infra has high batch size + high utilisation, speculative decoding is a minor win. If you serve mostly small batches (per-user low-latency), it's a major win.
— think, then check —

Expected: 2.5× speedup at α = 0.82, K = 4.

Expected accepted per round: α · (1 - α^(K+1)) / (1 - α) = 0.82 · (1 - 0.82^5) / 0.18 = 2.81 tokens.

If draft is 10× faster than target: per round work = 1 target + 4 drafts/10 = 1.4 target-equivalents.

Naive speedup: 2.81 / 1.4 = 2.0×. (Close to “expected 2.5×”.)

Why measured speedup is 1.6× (lower than expected):

Several possible causes:

1. Draft model is not 10× faster. If draft is 3× faster: per round = 1 + 4/3 = 2.33 target-equivalents. Speedup: 2.81 / 2.33 = 1.21×. Significantly lower than predicted.

This is THE most common cause. Draft models are often slower than expected because: smaller model means less efficient HBM utilisation; routing through PyTorch dispatch adds overhead; the draft has its own KV cache to manage.

2. Acceptance rate measured per-token but per-position varies. α=0.82 might be the AVERAGE, but specific positions (e.g., the first draft after a query) have α=0.6. The first rejection determines accepted-per-round; if it’s early, fewer tokens are accepted than the geometric formula predicts.

3. Verification overhead. Setting up the verification batch (concatenating draft tokens, managing KV cache during verification, handling batched sampling) adds overhead not captured in the theoretical formula. Particularly with PagedAttention, the block table management for the speculative tokens is non-trivial.

4. Continuous batching interaction. If you’re running speculative decoding within a continuous-batched system, the verification batch and other in-flight requests’ decodes share GPU resources. This adds contention.

5. Memory bandwidth saturation at the wrong point. If multiple speculative-decoding requests verify in parallel, they collectively saturate HBM bandwidth, eliminating the per-request benefit.

6. Draft mismatch. If the draft model is, e.g., trained on different data or with different alignment, α decreases over time as the conversation drifts from training distribution.

7. Sampling parameters mismatch. If draft uses different temperature than target, acceptance rate drops. The math assumes draft and target sample from same conditional distribution.

How to diagnose:

  • Profile: measure target forward pass time and draft forward pass time separately.
  • Check actual acceptance rate distribution per token position.
  • Measure overhead of speculative-decoding bookkeeping vs raw forward passes.
  • Test with different K values to find the actual optimum.

Typical fixes that close the gap:

  • Use a TIGHTER draft model (closer to target architecture, distilled from target).
  • Increase K only if the draft is robustly accurate (don’t waste draft compute on tokens likely to reject).
  • Reduce verification overhead through better kernel implementations (CUDA Graphs, kernel fusion).
  • Use MTP or EAGLE for higher α, eliminating the draft-model cost entirely.

Real production speculative decoding achieves 1.5-2× speedup consistently; getting 3× requires careful tuning and a well-matched draft model.

— think, then check —

Standard speculative (separate draft):

  • Use a small model from the same family (e.g., Llama 3 1B as draft for Llama 3 70B).
  • Per round: K forward passes through draft + 1 through target.
  • Pros: no target retraining needed; can mix-and-match models.
  • Cons: draft consumes its own HBM + bandwidth; draft α ~ 0.7-0.85.

MTP (Multi-Token Prediction):

  • Train the target with K extra output heads that predict tokens at offsets +1, +2, …, +K.
  • Per round: 1 forward pass through target (which produces K candidates via the extra heads) + 1 verification.
  • Pros: no separate draft model; the heads ship with the target; α can be very high (~0.95) because the heads have access to the same internal representations.
  • Cons: requires retraining the target model with the extra-head auxiliary loss; existing models without MTP can’t use this directly.

When MTP wins:

1. You CONTROL the training: if you’re training a new model anyway, adding MTP heads costs little (1-2% slower training) and gives 2-3× inference speedup. Net win.

2. The α benefit is decisive: at α = 0.95 vs 0.8, expected accepted per round goes from ~2.7 to ~4.0 — significant improvement.

3. Single-model serving: no separate draft to load, route, or manage. Simpler infra.

When standard speculative wins:

1. You’re serving EXISTING models: can’t retrain Llama 3 70B; need to add speculation post-hoc.

2. You have a family of model sizes: Llama 3 1B is “free” if you also have Llama 3 70B; can serve as draft without extra investment.

3. Flexibility: want to swap drafts based on workload (e.g., domain-specific draft for code, general draft for chat).

What’s actually shipping:

  • DeepSeek V3 / V4: MTP. Trained with multi-token heads; 3× inference speedup over equivalent non-MTP.
  • Qwen 3.6: MTP variant.
  • Llama 3 / 3.5 / 4: no MTP yet; standard speculative with Llama-3-1B or similar as draft.
  • llama.cpp: PR #22673 added MTP support late 2024, generic for any model with MTP heads.

The trajectory:

MTP is becoming standard for new model releases. Frontier labs increasingly bake MTP into the training pipeline because the inference savings compound. Backward-compatible inference (works with non-MTP models too) keeps standard speculative as a fallback.

In 2-3 years, expect most new LLM releases to ship with MTP heads, eliminating the need for separate draft models.

END OF CH.23 — Inference at scale.
§1 (KV cache + PagedAttention: paging fixes 80%+ internal fragmentation, 40× more concurrent requests) · §2 (continuous batching + prefill/decode disaggregation: GPU utilisation near 100%, hardware specialisation by phase) · §3 (speculative decoding: 2-3× speedup, MTP variants, the math of expected acceptance).

Part V continues: Ch.24 (training at scale) covers data/tensor/pipeline/expert parallelism, all-reduce, ZeRO/FSDP. Then Ch.25 (Quantization, done earlier) closes the systems half of the book.