Inside llama.cpp — and what PR #22673 changes
llama.cpp is the most-deployed local LLM runtime on Earth. It runs Qwen3, LLaMA, DeepSeek, Mistral, Gemma, and everything else on hardware ranging from a Raspberry Pi to an H100, in a single C++ codebase with no Python dependencies. To understand what “adding a new head” means operationally — and what PR #22673 actually changed — you have to understand its architecture: ggml (the underlying tensor library), GGUF (the model file format), and the build-then-execute pattern that lets the same model code run on CUDA, Metal, Vulkan, CPU AVX, NEON, and a dozen other backends with one source of truth. This section is a walking tour of that architecture, ending in the specific code shape PR #22673 takes to plug MTP into the per-microbatch inference loop.
The cast of characters
| Component | What it is | Lives in |
|---|---|---|
| ggml | Low-level tensor library: tensors, ops, computation graph, backends | ggml/ |
| GGUF | The binary model file: weights + metadata + architecture descriptor | one .gguf file |
| model loader | Reads GGUF metadata, maps tensors, picks the right architecture-specific code path | src/llama-model-loader.* |
| graph builder | Builds the computation graph for one forward pass, per model family | src/llama-model.cpp (build_qwen3, build_llama, …) |
| backend | CUDA / Metal / Vulkan / CPU — actually runs the ops | ggml/src/ggml-*/ |
| sampler | Picks one token from the final logits (temperature, top-p, etc.) | src/llama-sampling.cpp |
The flow is: GGUF file → loader → graph builder → backend executes graph → sampler → token. Then loop.
ggml’s build-then-execute pattern
The core abstraction. Most tensor libraries (NumPy, PyTorch in eager mode) execute operations immediately — write c = a @ b and the matmul runs that line. ggml works differently. First you build a graph of operations; only at the very end do you ask the backend to compute it.
struct ggml_tensor * a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, M, K);
struct ggml_tensor * b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, K, N);
struct ggml_tensor * c = ggml_mul_mat(ctx, a, b); // NO computation yet
struct ggml_tensor * d = ggml_add(ctx, c, bias); // STILL no computation
struct ggml_cgraph * gf = ggml_new_graph(ctx);
ggml_build_forward_expand(gf, d); // graph is now built
ggml_backend_graph_compute(backend, gf); // NOW everything runs
This separation is what lets the same code run on any hardware. The graph is a description of work; the backend chooses how to do it — fuse adjacent ops, allocate optimally for HBM layout, dispatch to cuBLAS or to a hand-tuned NEON kernel, etc.
For a transformer model, the build_* function constructs hundreds of these nodes — input embeddings, RoPE rotations, RMSNorms, attention QKV projections, attention softmax, MLP, residual adds, more RMSNorms, the lm_head — all wired together into one massive DAG. The backend then runs it. The viz below walks through the high-level shape:
process_ubatch that grabs the trunk's hidden states and feeds them to the MTP heads (which
already live in the GGUF for Qwen3.6). The K speculative tokens are then
verified in one additional parallel trunk pass — and accepted/rejected per
the Leviathan rule from §S.1. The wall-clock gain on Qwen3.6-27B: 7.0 → 21.6
tokens/sec.Toggle “before” and “after” the PR to see where MTP plugs in.
static void execute(Node* out) {
if (!out->data) {
if (out->a) execute(out->a);
if (out->b) execute(out->b);
execute_node(out);
}
}
static void print_tensor(const char* label, Node* n) {
printf("%s [%dx%d]:\n", label, n->rows, n->cols);
for (int i = 0; i < n->rows; i++) {
printf(" ");
for (int j = 0; j < n->cols; j++) printf("%7.3f ", n->data[i * n->cols + j]);
printf("\n");
}
}
int main(void) {
/* INPUTS — these are tensors the graph reads from. */
float x_data[] = {1.0f, 2.0f, -1.0f, 0.5f}; /* (1, 4) */
float W_data[] = {0.1f, 0.2f, 0.3f,
0.4f, 0.5f, 0.6f,
0.7f, 0.8f, 0.9f,
1.0f, 1.1f, 1.2f}; /* (4, 3) */
float b_data[] = {0.0f, 0.1f, -0.2f}; /* (1, 3) */
Node* x = new_input("x", 1, 4, x_data);
Node* W = new_input("W", 4, 3, W_data);
Node* b = new_input("b", 1, 3, b_data);
/* BUILD the graph for `out = matmul(x, W) + b` (the "lm_head") */
Node* xW = new_matmul("xW", x, W);
Node* out = new_add ("out", xW, b);
printf("=== single-head graph (analogue of lm_head) ===\n");
printf("graph: input(x) → matmul → add(bias) → output\n\n");
execute(out);
print_tensor("out (standard lm_head)", out);
/* NOW — what does "adding a head" mean?
* Add a SECOND output that consumes the same hidden state x (or
* an intermediate node) but multiplies against a different weight
* matrix. This is what an MTP head does: same trunk output, new
* weight matrix per offset. */The kernel builds a single-head graph (analogue of lm_head), executes it, then extends the same graph with a second head consuming the same trunk output — exactly the operation pattern of adding an MTP head. The output:
=== single-head graph (analogue of lm_head) ===
out [1x3]: 0.700 1.050 1.000
=== two-head graph (analogue of trunk + MTP head 1) ===
out [1x3]: 0.700 1.050 1.000
mtp_head_1 [1x3]: 0.350 0.450 0.200
→ Both heads consumed the SAME trunk output x without re-running the trunk.
That last line is the entire engineering point. In a real transformer, the “trunk output” is the post-final-RMSNorm hidden state (~4096 floats per token). Both lm_head and the MTP heads consume it; the trunk runs once.
ggml is build-then-execute: you first construct a graph of operations (no compute), then hand the whole graph to a backend that schedules and runs it. NumPy/eager-PyTorch are immediate — each operation computes the result as soon as you write it.
Portability: the graph is a hardware-independent description of work. The same graph builder runs on CUDA (which dispatches to cuBLAS / custom CUDA kernels), Metal (Apple’s MPS), Vulkan, or CPU (with SIMD). The backend chooses the kernel implementation for each op. So model code is written once and the backend layer absorbs all hardware variation. This is why llama.cpp can support Qwen3, LLaMA, DeepSeek, etc. on every platform under the sun without per-platform model code — only per-platform backend code.
What’s actually in a GGUF
A GGUF file is a self-describing binary container. Three regions, in order:
Notice the last block of tensor descriptors: mtp.0.head.weight, mtp.1.head.weight, etc. These are the MTP heads, shipped inside the same GGUF as the main model. A model trained with MTP (Qwen3.6, DeepSeek-V3) already has these tensors; they were written by the GGUF converter. Before PR #22673, llama.cpp’s graph builder ignored them — they were unreferenced bytes in the file. The PR is, in essence, “consult these tensors when building the graph.”
(1) Header: magic bytes “GGUF”, version, count of tensors and metadata entries.
(2) Metadata key-value table: typed (string, int, float, bool, array) key-value pairs describing the model architecture, tokenizer, quantization scheme, etc. Self-describing — anyone reading the file can determine its architecture without external schemas.
(3) Tensor table + raw data: each tensor has a name, shape, dtype, byte offset into the raw data region. The raw region is mmap-able — loading a model is mmap + pointer arithmetic, with zero parse overhead.
vs PyTorch .pt: a .pt is a pickled Python object. Loading requires (a) Python runtime, (b) the model’s Python class definition available in scope, (c) a deserialization pass. None of which work in a C/C++ runtime or on memory-constrained systems. GGUF needs nothing but a few freads and an mmap — which is why llama.cpp can boot a 27B model in 2 seconds on consumer hardware.
”Adding a head,” concretely
What does it take to add an MTP head to llama.cpp’s graph for Qwen3? Roughly three things:
(1) Loader: declare the new tensor names so they’re loaded. The model loader’s tensor-name registry gets entries like mtp.0.head.weight, mtp.1.head.weight, etc. Without this, the GGUF’s MTP tensors would be visible (the file still parses) but unbound to any model-level handle. This is a registration change — a few lines.
(2) Graph builder: extend build_qwen3 to add the head ops. After the trunk’s final hidden state node is produced, append N new matmul ops, one per MTP head, each consuming the same hidden state and producing its head’s logits. This is the architectural change — typically 20–50 lines of code per head type.
(3) Inference loop: route the new heads’ outputs to the speculative-decoding logic. When process_ubatch runs the graph, it now has access to N additional logit tensors. The speculative loop samples K tokens from them, runs a verifier pass, applies the Leviathan rejection rule, and decides how many tokens to commit. This is the wiring change — the largest piece, since it touches the sampling and context-management code.
PR #22673 does all three. The “hook” the PR description mentions is in step (3): a callback that the per-ubatch processing function invokes after the trunk’s hidden state is ready, allowing the MTP code to consume it. This callback design is what lets MTP run as an optional path — if MTP isn’t enabled, the hook is a no-op and the runtime behaves exactly like before.
Why this is a hook, not just an inline call. The PR’s author chose a hook rather than just inserting MTP code directly into the loop because llama.cpp has to support many model families (LLaMA, Qwen, Mistral, Gemma, DeepSeek, Phi, …) and only some of them ship with MTP heads. A hook lets the MTP path opt-in per architecture — if Qwen3.6 has MTP, register the hook; if LLaMA-3 doesn’t, don’t. The trunk code stays clean. This is good engineering: extension by composition rather than by branching, the same pattern you’ll see in vLLM’s “spec_decoder” pluggable component or TensorRT-LLM’s “speculative” engine plugins.
What PR #22673 reports
Real numbers from the PR description, on Qwen3.6-27B with 3 MTP heads:
| Metric | Without MTP | With MTP (3 heads) |
|---|---|---|
| Generation throughput | 7.0 tokens/sec | 21.6 tokens/sec |
| Acceptance rate | — | 72.18% |
| Wall-clock for benchmark suite | 201 s | 83.8 s |
| Speedup | 1× | ~3× |
Plug those numbers into the §S.1 formula: speedup = (p(1−pᴷ)/(1−p) + 1) / (1 + K · c_d) with p ≈ 0.72, K = 3, and the very low c_d that MTP enables (heads run almost-free on top of the trunk), the predicted speedup lands ~2.5×–3×. The empirical 3× matches.
The PR also calls out the operational caveats — currently n_parallel = 1 (the speculative path doesn’t yet compose with batched serving), and there’s a small prompt-processing penalty from device-to-host hidden-state transfers when the MTP heads consume them. Both are listed for future optimisation. This is what shipping a real production-grade feature looks like: the algorithm works, the speedup is real, and there are a few rough edges to polish before it’s the default for everyone.
Loader change (~10 LoC): register the GGUF tensor names mtp.0.head.weight, etc. so the existing model-loader machinery picks them up and binds them to ggml tensor handles. Pure plumbing.
Graph builder change (~50 LoC per model family): in build_qwen3(), after the trunk’s final hidden state node is produced, attach N additional matmul ops — one per MTP head — that consume the same hidden state and emit per-head logits. The added graph nodes are no-ops if MTP isn’t enabled at runtime (their outputs are simply unread).
Inference loop change (the bulk of the PR): add a hook in process_ubatch that, after the trunk forward pass completes, optionally invokes the MTP path: sample K speculative tokens from the head outputs, run a verifier pass over them, apply the Leviathan rejection rule, commit the accepted prefix.
Why a hook rather than inline: llama.cpp supports ~15 model families (LLaMA, Qwen, Mistral, Gemma, DeepSeek, Phi, Mixtral, Command-R, …) and only some of them ship with MTP heads. Inlining MTP code into the per-ubatch path would (a) clutter the hot loop with conditional checks, (b) require every model family’s graph builder to know about MTP-specific concepts even when they don’t have MTP, and (c) entangle the speculative-decoding logic with the normal sampling path in a way that would make future variants (Medusa, EAGLE, n-gram speculation) hard to add. A hook makes MTP one component the runtime composes in optionally. Same pattern used in vLLM’s spec_decoder plugin interface and TensorRT-LLM’s speculative-engine plugins — production runtimes need composition, not entangled monoliths.
The Ch.2 §1 connection: just as a matrix is a function (not a 2-D array), an MTP head is a function that consumes the trunk’s hidden state. The hook is the function-application point. The whole picture — trunk → hidden state → multiple downstream heads each consuming that state — is the “matrix as composed functions” pattern from Ch.2, made concrete in code: every head is a linear function (matmul against its weight matrix), and they all compose with the same trunk output, in parallel.
END OF SPECIAL CHAPTER S — Speculative decoding & multi-token prediction.
§S.1 (the autoregressive bottleneck + Leviathan rejection rule) · §S.2 (Medusa / EAGLE / MTP variants) · §S.3 (llama.cpp internals + PR #22673 walk).
Three kernels (spec_sim.c, mock_ggml.c, and the §S.1 throughput sim) and three Svelte visualisations grounding the whole picture. The chapter walks from “why is LLM inference slow?” to “what does PR #22673 actually change?” with both the math (acceptance probability, expected speedup, rejection-sampling correctness) and the engineering (ggml graph pattern, GGUF layout, hook composition).
This chapter sits in Part S — Spotlight: production accelerators, an out-of-sequence chapter inserted at the user’s request to track late-2025 production developments. We’ll fold this material back into Ch.22 (Inference at scale) when we get there in sequence, but the standalone version stays as a self-contained reference. Next: back to Part II, Ch.7 — Random projections and the Johnson-Lindenstrauss lemma.