Medusa, EAGLE, MTP — self-speculative variants

Section S.2

Medusa, EAGLE, MTP — self-speculative variants

Classical speculative decoding (§S.1) needs two models: a small fast drafter and a large slow verifier. That’s operationally awkward — twice the model files, twice the cold-start time, twice the memory footprint when both have to be resident. The 2024–2025 wave of work eliminates the second model entirely by adding lightweight auxiliary heads to the target. The trunk of the target model runs once per round; its hidden states feed several small heads in parallel, each predicting a different future token. Three families dominate: Medusa (extra MLP heads on the trained-and-frozen target), EAGLE (an auxiliary autoregressive head that predicts features rather than tokens), and MTP (Multi-Token Prediction baked into the training objective from the start). Each is a different point in the design space; each pushes the per-token acceptance probability higher than classical speculative decoding; each is what someone on the inference team is implementing right now. llama.cpp’s PR #22673 — covered in §S.3 — integrates the MTP variant.

What “head” means here, exactly

A head architecture A small task-specific neural sub-module attached to a shared backbone (the 'trunk'). In transformers, the language-modeling head is the final linear projection from hidden state to vocabulary logits — that's the head all standard generation uses. Adding additional heads (Medusa, MTP) means adding more output projections that share the trunk's representations and produce predictions at different future positions. Then → now: 'head' had been used since the early 2010s for the final classifier layer in image models. Transformers brought the term to NLP via 'attention heads' (Ch.13). Now (2024+) 'head' covers any task-specific output module that consumes shared backbone features — classification heads, language-modeling heads, speculative heads, value heads (RLHF), reward heads (RLHF). in a transformer is a small output module attached to the shared trunk. The standard language-modeling head is a single linear projection from the final hidden state (typically 4096-d) to the vocabulary logits (typically 32K-d or 128K-d):

logits = hidden_state @ W_vocabᵀ // (1, 4096) · (4096, 128K) → (1, 128K)

This is the head every plain decoding step uses. The whole point of the Medusa / EAGLE / MTP family is to add more of these — each predicting a token at a different future offset — so that one trunk forward pass produces multiple speculative tokens. The viz below contrasts the four approaches; they differ in what the additional heads predict and when they were trained.

Standard speculative · two-model

The classic Leviathan setup. A small *separate* model proposes K tokens; the big target model verifies all K in one forward pass via the rejection-sampling rule. Pros: any small model can draft. Cons: two model copies in memory; the draft model is "free" only if you have a smaller member of the same family already on disk.

typical accept rate p

70%

Four approaches to "produce several tokens per verifier pass," in increasing cleverness. The number that matters at the bottom is p — the per-token acceptance rate. Standard speculative decoding gets ~70% with a well-matched draft. Medusa pushes to ~80% by sharing the trunk. EAGLE goes to ~88% by predicting features instead of tokens. MTP — when the model was trained for it — reaches ~95%, which combined with K = 4 gives the ~3× speedup llama.cpp's PR reports.

Click between the four. Watch the architecture change.

Medusa — multi-head on top of a frozen trunk

Cai, Chen, Tian, Liu, Lin, Wang et al., “Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads” (ICML 2024). The simplest of the self-speculative family. Take a fully trained LLM. Freeze the trunk. Add K small MLP heads on top:

head_k(h) = ResMLP_k(h) @ W_vocabᵀ for k = 1..K each head outputs logits for "the token K positions in the future"

Each Medusa head is a few-layer MLP plus the vocabulary projection. Training: fine-tune only the heads on the same next-token data, with each head’s loss computed against the token at the relevant future offset. Medusa-1 keeps the trunk frozen entirely (true drop-in addition). Medusa-2 unfreezes the trunk too, trading model purity for a few more points of acceptance.

The speculative loop becomes:

One trunk forward pass produces one hidden state.
All K heads run in parallel on that hidden state, producing K candidate logits.
Sample candidate tokens; build a tree of continuations (Medusa uses a clever tree-attention trick to verify multiple candidate sequences in one verifier pass).
Apply the rejection rule from §S.1.

Reported speedup: 2.2× (Medusa-1) to 3.6× (Medusa-2) on Vicuna-7B/13B/33B. The trunk-frozen variant is the operationally important one because you can ship Medusa heads as a separate small file alongside an existing model.

EAGLE — speculate at the feature level

Li, Wei, Zhang, Zhang, “EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty” (ICML 2024). EAGLE’s key insight: predicting features is easier than predicting tokens. The penultimate-layer hidden state (“features”) evolves smoothly across positions — much more so than the discrete token distribution. So you can train a small autoregressive head that takes features as input and produces predicted features as output:

feature_head(h_t, t_t) → ĥ_{t+1} ← predicted next feature feature_head(ĥ_{t+1}, t̂_{t+1}) → ĥ_{t+2} ... final step: ĥ_{t+k} → logits → t̂_{t+k}

The head takes both the current feature and the previously-emitted token (a lookahead trick that resolves the feature-uncertainty problem the title alludes to). Tokens are decoded from the predicted features only at the final step.

Why this helps: features carry richer information than the argmax token. Predicting “what will the next feature be” is a smoother regression than “which of 128K tokens will be sampled next.” So acceptance probability rises — EAGLE typically reports ~85–90% per-token, vs Medusa’s ~80–82%. Li et al. measure 2.7×–3.5× speedup on LLaMA-70B Chat, with EAGLE-2 (their iteration) reaching 4× on some benchmarks.

The downside: EAGLE is structurally more complex. The head is autoregressive (you run it K times sequentially) rather than parallel like Medusa, so it costs more wall-clock per round — but the higher acceptance more than compensates. There’s also a careful uncertainty-quantification step EAGLE adds to know when to stop drafting.

— think, then check —

Two reasons.

(1) Features are smoother than tokens. The next token comes from an argmax over a 128K-vocab logit distribution, which is highly discrete and sensitive — a tiny perturbation in features can flip the argmax. The feature itself, on the other hand, evolves smoothly across positions. Predicting “the next feature vector” is a regression over a 4096-d real-valued space; predicting “the next token” is a classification over 128K classes that’s much harder to nail.

(2) Features carry more information than the discrete sampled token. A token captures only the argmax (or one sample) of the distribution; the feature carries the full state of “what the model is thinking.” Feeding features into the next prediction lets the speculative head use that richer signal.

Operationally: EAGLE pushes per-token p from ~0.82 (Medusa) to ~0.88, which at K = 4–6 is the difference between 2.5× and 3.5× wall-clock speedup. The cost is implementation complexity (the head is autoregressive instead of parallel).

↳ §S.2 EAGLE

MTP — train for it from the start

Gloeckle, Idrissi, Rozière, Lopez-Paz, Synnaeve, “Better & Faster Large Language Models via Multi-token Prediction” (ICML 2024). Subsequently adopted in DeepSeek-V3 (arXiv:2412.19437, 2024) and Qwen3.6. The most invasive variant — and the one llama.cpp’s PR #22673 implements: train the LLM with N output heads from day one.

Loss = Σ_{k=1..N} λ_k · CrossEntropy( head_k(trunk(x_{<t})) , x_{t+k} ) where x_{<t} is the prefix and x_{t+k} is the token at offset k

The model has N output heads, each predicting the token at offset k = 1, 2, …, N. All N losses are summed (with optional per-head weights λ_k) during training. At inference, the heads are used speculatively:

One trunk pass produces the hidden state.
The N MTP heads each emit a logit distribution for their target offset.
Sample N candidate tokens.
Verify them in one parallel forward pass over the trunk (or just the first layer + the verifier head — implementation detail).
Accept/reject per §S.1’s Leviathan rule.

MTP core technique Multi-Token Prediction. A training objective in which the model is trained with N parallel output heads, each predicting a token at a different future offset (1, 2, …, N) from the current position. Originally Gloeckle et al. (Meta, ICML 2024); adopted in DeepSeek-V3 (Dec 2024) and Qwen3.6. At inference, the same heads provide speculative draft tokens for free — no separate draft model required. The auxiliary training objective also empirically improves the trunk's representations (the heads regularise toward forward-aware features), so MTP is one of the few inference accelerations that also makes the base model better. Then → now: in 2023 'multi-token prediction' meant n-gram-style heads added at inference (Medusa). In 2024 it became a training objective (Gloeckle), and by late 2024 / early 2025 the top-tier production models (DeepSeek-V3 671B, Qwen3.6) ship with MTP heads built in. llama.cpp's late-2025 integration (PR #22673) is the runtime catching up to the training innovation. has the highest acceptance rates of any speculative scheme because the heads were trained jointly with the trunk — their predictions are perfectly aligned with what the trunk would produce autoregressively. DeepSeek-V3 reports ~85% acceptance per head; llama.cpp’s PR #22673 reports 72.18% on Qwen3.6-27B with 3 MTP heads, giving 3× generation speedup (7.0 → 21.6 tokens/sec).

The unique advantage: MTP is the only scheme where the speculative capability is “free” in the most literal sense — no inference-time training, no extra files to distribute, no separate engineering effort to maintain head-trunk alignment. The model ships with the heads. As long as the runtime knows how to use them (which is what PR #22673 adds to llama.cpp), the speedup is automatic.

The training-time bonus. Gloeckle et al. observed that MTP training as an auxiliary objective improves base-model quality by ~12% on HumanEval and ~17% on MBPP for 13B models, with no increase in training time. The reason: predicting multiple future tokens forces the trunk’s representations to encode longer-range structure, which generalises beyond the speculative use case. This is the only inference accelerator in the speculative family that also makes the model better at its actual job.

— think, then check —

Maintaining two models. The drafter and verifier must be:

Trained, distributed, and version-controlled as a pair.
Both resident in memory (the small draft adds 0.5–10% to memory footprint depending on size ratio).
Compatible — same tokenizer, same special-token IDs, same calibration. A model and its draft drift apart over fine-tuning iterations.

Self-speculative variants (Medusa/EAGLE/MTP) put the “draft” into the target itself as auxiliary heads. One model file. One memory footprint. No drift. No separate distribution. The only operational delta is “the runtime needs to know how to call the extra heads,” which is exactly what PR #22673 adds to llama.cpp.

↳ §S.2 motivation

Comparison at a glance

Scheme	Draft model?	Heads trained how	Typical p	Speedup	Memory cost
Standard speculative	Yes, separate file	(whatever the draft is)	0.7	2–3×	+draft size
Medusa-1	No	Fine-tune heads on frozen target	0.80	2.2×	+K·MLP
Medusa-2	No	Fine-tune heads + unfreeze trunk	0.82	2.3–3.6×	+K·MLP
EAGLE	No	Train autoregressive feature head	0.88	2.7–3.5×	+autoregressive head
MTP	No	Trained jointly with main model	0.95	~3× (Qwen3.6)	+N·linear

The progression — standard → Medusa → EAGLE → MTP — is from “least training change” to “most training change.” Each level up requires more upfront work and gives higher per-token acceptance. MTP is the bleeding edge as of late 2025 because it’s the only scheme that can be a training objective from scratch (rather than a fine-tuning afterthought) — and that’s what gives it the alignment between trunk and heads that drives the acceptance rate up.

— think, then check —

Architectural similarity: both have N small output heads on top of a shared trunk, each predicting a token at a different future offset. The heads themselves can be the same shape — a residual MLP plus a vocabulary projection.

Acceptance-rate gap (95% vs 82%): the difference is when the heads are trained.

Medusa trains the heads after the trunk is fully trained and frozen. The trunk’s hidden states were optimised for next-token prediction (one head, offset 1) — they don’t natively encode “what will the token 4 positions from now be.” The added heads are squeezing future predictions out of features that weren’t trained to carry that information. Acceptance is limited by the trunk’s lack of forward-awareness.

MTP trains the trunk and all heads jointly from the start. The trunk’s hidden states are explicitly optimised to support predictions at offsets 1, 2, …, N simultaneously. The features themselves encode multi-step forward structure. The result is heads whose predictions agree with what the trunk would autoregressively produce — much higher acceptance.

The Gloeckle paper also found this multi-head training objective improves the trunk itself (~12% HumanEval lift at 13B) because forcing the representations to be forward-aware regularises them. That’s the rare case where the inference accelerator is also a quality improvement; it’s what makes MTP the dominant approach for newly-trained models from late 2024 onward.

↳ §S.2 MTP vs Medusa

END OF §S.2 — Medusa, EAGLE, MTP.
Built: HeadVariants viz (click between standard / Medusa / EAGLE / MTP architectures; see the data flow and the acceptance rates compared). Three recall items: easy (why self-speculative matters operationally), medium (EAGLE’s feature-prediction insight), hard (the structural reason MTP outperforms Medusa).
Coming next: §S.3 — Inside llama.cpp, and what PR #22673 actually changes. Computation graphs, GGUF tensor layout, how “adding a head” lands in code.