Speculative decoding & multi-token prediction

How modern LLM runtimes (llama.cpp, vLLM, TensorRT-LLM) sustain a 2–3× throughput boost without retraining the model. The math of speculative decoding, the Medusa / EAGLE / MTP family of "extra heads," and a walk through llama.cpp PR #22673.

§1 The autoregressive bottleneck & speculative decoding §2 Medusa, EAGLE, MTP — self-speculative variants §3 Inside llama.cpp — and what PR

§1 The autoregressive bottleneck & speculative decoding
An LLM produces one token per forward pass. Speculative decoding turns that into one verifier pass per several tokens — a small draft model proposes K tokens, the big verifier accepts a prefix, the math says the output distribution is unchanged. The 2–3× speedup behind every modern inference stack.
§2 Medusa, EAGLE, MTP — self-speculative variants
The headache with classic speculative decoding is maintaining two models. The 2024–2025 wave (Medusa, EAGLE, MTP) puts the "draft" into the target itself as auxiliary heads — pushing acceptance from ~70% to ~95% and eliminating the draft model entirely.
§3 Inside llama.cpp — and what PR
A tour of llama.cpp — ggml's build-then-execute graph pattern, what GGUF actually contains, and how "adding a head" lands in the codebase. Then a walk through PR

← ALL CHAPTERS