S — Spotlight: production accelerators → Chapter 99
FROM SYSTEMS TO FRONTIER ML
Speculative decoding & multi-token prediction
How modern LLM runtimes (llama.cpp, vLLM, TensorRT-LLM) sustain a 2–3× throughput boost without retraining the model. The math of speculative decoding, the Medusa / EAGLE / MTP family of "extra heads," and a walk through llama.cpp PR #22673.
§1 The autoregressive bottleneck & speculative decoding
§2 Medusa, EAGLE, MTP — self-speculative variants
§3 Inside llama.cpp — and what PR