Medusa, EAGLE, MTP — self-speculative variants
Classical speculative decoding (§S.1) needs two models: a small fast drafter and a large slow verifier. That’s operationally awkward — twice the model files, twice the cold-start time, twice the memory footprint when both have to be resident. The 2024–2025 wave of work eliminates the second model entirely by adding lightweight auxiliary heads to the target. The trunk of the target model runs once per round; its hidden states feed several small heads in parallel, each predicting a different future token. Three families dominate: Medusa (extra MLP heads on the trained-and-frozen target), EAGLE (an auxiliary autoregressive head that predicts features rather than tokens), and MTP (Multi-Token Prediction baked into the training objective from the start). Each is a different point in the design space; each pushes the per-token acceptance probability higher than classical speculative decoding; each is what someone on the inference team is implementing right now. llama.cpp’s PR #22673 — covered in §S.3 — integrates the MTP variant.
What “head” means here, exactly
A head in a transformer is a small output module attached to the shared trunk. The standard language-modeling head is a single linear projection from the final hidden state (typically 4096-d) to the vocabulary logits (typically 32K-d or 128K-d):
This is the head every plain decoding step uses. The whole point of the Medusa / EAGLE / MTP family is to add more of these — each predicting a token at a different future offset — so that one trunk forward pass produces multiple speculative tokens. The viz below contrasts the four approaches; they differ in what the additional heads predict and when they were trained.
Click between the four. Watch the architecture change.
Medusa — multi-head on top of a frozen trunk
Cai, Chen, Tian, Liu, Lin, Wang et al., “Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads” (ICML 2024). The simplest of the self-speculative family. Take a fully trained LLM. Freeze the trunk. Add K small MLP heads on top:
Each Medusa head is a few-layer MLP plus the vocabulary projection. Training: fine-tune only the heads on the same next-token data, with each head’s loss computed against the token at the relevant future offset. Medusa-1 keeps the trunk frozen entirely (true drop-in addition). Medusa-2 unfreezes the trunk too, trading model purity for a few more points of acceptance.
The speculative loop becomes:
- One trunk forward pass produces one hidden state.
- All K heads run in parallel on that hidden state, producing K candidate logits.
- Sample candidate tokens; build a tree of continuations (Medusa uses a clever tree-attention trick to verify multiple candidate sequences in one verifier pass).
- Apply the rejection rule from §S.1.
Reported speedup: 2.2× (Medusa-1) to 3.6× (Medusa-2) on Vicuna-7B/13B/33B. The trunk-frozen variant is the operationally important one because you can ship Medusa heads as a separate small file alongside an existing model.
EAGLE — speculate at the feature level
Li, Wei, Zhang, Zhang, “EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty” (ICML 2024). EAGLE’s key insight: predicting features is easier than predicting tokens. The penultimate-layer hidden state (“features”) evolves smoothly across positions — much more so than the discrete token distribution. So you can train a small autoregressive head that takes features as input and produces predicted features as output:
The head takes both the current feature and the previously-emitted token (a lookahead trick that resolves the feature-uncertainty problem the title alludes to). Tokens are decoded from the predicted features only at the final step.
Why this helps: features carry richer information than the argmax token. Predicting “what will the next feature be” is a smoother regression than “which of 128K tokens will be sampled next.” So acceptance probability rises — EAGLE typically reports ~85–90% per-token, vs Medusa’s ~80–82%. Li et al. measure 2.7×–3.5× speedup on LLaMA-70B Chat, with EAGLE-2 (their iteration) reaching 4× on some benchmarks.
The downside: EAGLE is structurally more complex. The head is autoregressive (you run it K times sequentially) rather than parallel like Medusa, so it costs more wall-clock per round — but the higher acceptance more than compensates. There’s also a careful uncertainty-quantification step EAGLE adds to know when to stop drafting.
Two reasons.
(1) Features are smoother than tokens. The next token comes from an argmax over a 128K-vocab logit distribution, which is highly discrete and sensitive — a tiny perturbation in features can flip the argmax. The feature itself, on the other hand, evolves smoothly across positions. Predicting “the next feature vector” is a regression over a 4096-d real-valued space; predicting “the next token” is a classification over 128K classes that’s much harder to nail.
(2) Features carry more information than the discrete sampled token. A token captures only the argmax (or one sample) of the distribution; the feature carries the full state of “what the model is thinking.” Feeding features into the next prediction lets the speculative head use that richer signal.
Operationally: EAGLE pushes per-token p from ~0.82 (Medusa) to ~0.88, which at K = 4–6 is the difference between 2.5× and 3.5× wall-clock speedup. The cost is implementation complexity (the head is autoregressive instead of parallel).
MTP — train for it from the start
Gloeckle, Idrissi, Rozière, Lopez-Paz, Synnaeve, “Better & Faster Large Language Models via Multi-token Prediction” (ICML 2024). Subsequently adopted in DeepSeek-V3 (arXiv:2412.19437, 2024) and Qwen3.6. The most invasive variant — and the one llama.cpp’s PR #22673 implements: train the LLM with N output heads from day one.
The model has N output heads, each predicting the token at offset k = 1, 2, …, N. All N losses are summed (with optional per-head weights λ_k) during training. At inference, the heads are used speculatively:
- One trunk pass produces the hidden state.
- The N MTP heads each emit a logit distribution for their target offset.
- Sample N candidate tokens.
- Verify them in one parallel forward pass over the trunk (or just the first layer + the verifier head — implementation detail).
- Accept/reject per §S.1’s Leviathan rule.
MTP has the highest acceptance rates of any speculative scheme because the heads were trained jointly with the trunk — their predictions are perfectly aligned with what the trunk would produce autoregressively. DeepSeek-V3 reports ~85% acceptance per head; llama.cpp’s PR #22673 reports 72.18% on Qwen3.6-27B with 3 MTP heads, giving 3× generation speedup (7.0 → 21.6 tokens/sec).
The unique advantage: MTP is the only scheme where the speculative capability is “free” in the most literal sense — no inference-time training, no extra files to distribute, no separate engineering effort to maintain head-trunk alignment. The model ships with the heads. As long as the runtime knows how to use them (which is what PR #22673 adds to llama.cpp), the speedup is automatic.
The training-time bonus. Gloeckle et al. observed that MTP training as an auxiliary objective improves base-model quality by ~12% on HumanEval and ~17% on MBPP for 13B models, with no increase in training time. The reason: predicting multiple future tokens forces the trunk’s representations to encode longer-range structure, which generalises beyond the speculative use case. This is the only inference accelerator in the speculative family that also makes the model better at its actual job.
Maintaining two models. The drafter and verifier must be:
- Trained, distributed, and version-controlled as a pair.
- Both resident in memory (the small draft adds 0.5–10% to memory footprint depending on size ratio).
- Compatible — same tokenizer, same special-token IDs, same calibration. A model and its draft drift apart over fine-tuning iterations.
Self-speculative variants (Medusa/EAGLE/MTP) put the “draft” into the target itself as auxiliary heads. One model file. One memory footprint. No drift. No separate distribution. The only operational delta is “the runtime needs to know how to call the extra heads,” which is exactly what PR #22673 adds to llama.cpp.
Comparison at a glance
| Scheme | Draft model? | Heads trained how | Typical p | Speedup | Memory cost |
|---|---|---|---|---|---|
| Standard speculative | Yes, separate file | (whatever the draft is) | 0.7 | 2–3× | +draft size |
| Medusa-1 | No | Fine-tune heads on frozen target | 0.80 | 2.2× | +K·MLP |
| Medusa-2 | No | Fine-tune heads + unfreeze trunk | 0.82 | 2.3–3.6× | +K·MLP |
| EAGLE | No | Train autoregressive feature head | 0.88 | 2.7–3.5× | +autoregressive head |
| MTP | No | Trained jointly with main model | 0.95 | ~3× (Qwen3.6) | +N·linear |
The progression — standard → Medusa → EAGLE → MTP — is from “least training change” to “most training change.” Each level up requires more upfront work and gives higher per-token acceptance. MTP is the bleeding edge as of late 2025 because it’s the only scheme that can be a training objective from scratch (rather than a fine-tuning afterthought) — and that’s what gives it the alignment between trunk and heads that drives the acceptance rate up.
Architectural similarity: both have N small output heads on top of a shared trunk, each predicting a token at a different future offset. The heads themselves can be the same shape — a residual MLP plus a vocabulary projection.
Acceptance-rate gap (95% vs 82%): the difference is when the heads are trained.
Medusa trains the heads after the trunk is fully trained and frozen. The trunk’s hidden states were optimised for next-token prediction (one head, offset 1) — they don’t natively encode “what will the token 4 positions from now be.” The added heads are squeezing future predictions out of features that weren’t trained to carry that information. Acceptance is limited by the trunk’s lack of forward-awareness.
MTP trains the trunk and all heads jointly from the start. The trunk’s hidden states are explicitly optimised to support predictions at offsets 1, 2, …, N simultaneously. The features themselves encode multi-step forward structure. The result is heads whose predictions agree with what the trunk would autoregressively produce — much higher acceptance.
The Gloeckle paper also found this multi-head training objective improves the trunk itself (~12% HumanEval lift at 13B) because forcing the representations to be forward-aware regularises them. That’s the rare case where the inference accelerator is also a quality improvement; it’s what makes MTP the dominant approach for newly-trained models from late 2024 onward.
END OF §S.2 — Medusa, EAGLE, MTP.
Built: HeadVariants viz (click between standard / Medusa / EAGLE / MTP architectures; see the data flow and the acceptance rates compared). Three recall items: easy (why self-speculative matters operationally), medium (EAGLE’s feature-prediction insight), hard (the structural reason MTP outperforms Medusa).
Coming next: §S.3 — Inside llama.cpp, and what PR #22673 actually changes. Computation graphs, GGUF tensor layout, how “adding a head” lands in code.