S — Spotlight: production accelerators → Chapter 99
FROM SYSTEMS TO FRONTIER ML

Speculative decoding & multi-token prediction

How modern LLM runtimes (llama.cpp, vLLM, TensorRT-LLM) sustain a 2–3× throughput boost without retraining the model. The math of speculative decoding, the Medusa / EAGLE / MTP family of "extra heads," and a walk through llama.cpp PR #22673.

§1 The autoregressive bottleneck & speculative decoding §2 Medusa, EAGLE, MTP — self-speculative variants §3 Inside llama.cpp — and what PR

← ALL CHAPTERS