Inside llama.cpp — and what PR

Section S.3

Inside llama.cpp — and what PR #22673 changes

llama.cpp is the most-deployed local LLM runtime on Earth. It runs Qwen3, LLaMA, DeepSeek, Mistral, Gemma, and everything else on hardware ranging from a Raspberry Pi to an H100, in a single C++ codebase with no Python dependencies. To understand what “adding a new head” means operationally — and what PR #22673 actually changed — you have to understand its architecture: ggml (the underlying tensor library), GGUF (the model file format), and the build-then-execute pattern that lets the same model code run on CUDA, Metal, Vulkan, CPU AVX, NEON, and a dozen other backends with one source of truth. This section is a walking tour of that architecture, ending in the specific code shape PR #22673 takes to plug MTP into the per-microbatch inference loop.

The cast of characters

Component	What it is	Lives in
ggml library A C tensor library (https://github.com/ggml-org/ggml) developed by Georgi Gerganov as part of the llama.cpp ecosystem. Defines tensors, operations on them, and a build-then-execute graph pattern. Has backends for CPU (with SIMD), CUDA, Metal, Vulkan, SYCL, HIP, and more. llama.cpp is the highest-profile consumer; ggml has spawned its own ecosystem (whisper.cpp, stable-diffusion.cpp, etc.).	Low-level tensor library: tensors, ops, computation graph, backends	`ggml/`
GGUF format GPT-Generated Unified Format — the binary file format that llama.cpp uses to load models. Contains: (1) a metadata key-value store describing architecture (qwen3 / llama / mistral / …), tokenizer config, quantization scheme, etc.; (2) a list of named tensors with their shape, dtype, and byte offset; (3) the raw tensor data (mmap-able). Self-describing — same file works across hardware.	The binary model file: weights + metadata + architecture descriptor	one `.gguf` file
model loader	Reads GGUF metadata, maps tensors, picks the right architecture-specific code path	`src/llama-model-loader.*`
graph builder	Builds the computation graph for one forward pass, per model family	`src/llama-model.cpp` (`build_qwen3`, `build_llama`, …)
backend	CUDA / Metal / Vulkan / CPU — actually runs the ops	`ggml/src/ggml-*/`
sampler	Picks one token from the final logits (temperature, top-p, etc.)	`src/llama-sampling.cpp`

The flow is: GGUF file → loader → graph builder → backend executes graph → sampler → token. Then loop.

ggml’s build-then-execute pattern

The core abstraction. Most tensor libraries (NumPy, PyTorch in eager mode) execute operations immediately — write c = a @ b and the matmul runs that line. ggml works differently. First you build a graph of operations; only at the very end do you ask the backend to compute it.

struct ggml_tensor * a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, M, K);
struct ggml_tensor * b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, K, N);
struct ggml_tensor * c = ggml_mul_mat(ctx, a, b);    // NO computation yet
struct ggml_tensor * d = ggml_add(ctx, c, bias);     // STILL no computation

struct ggml_cgraph * gf = ggml_new_graph(ctx);
ggml_build_forward_expand(gf, d);                    // graph is now built

ggml_backend_graph_compute(backend, gf);             // NOW everything runs

This separation is what lets the same code run on any hardware. The graph is a description of work; the backend chooses how to do it — fuse adjacent ops, allocate optimally for HBM layout, dispatch to cuBLAS or to a hand-tuned NEON kernel, etc.

For a transformer model, the build_* function constructs hundreds of these nodes — input embeddings, RoPE rotations, RMSNorms, attention QKV projections, attention softmax, MLP, residual adds, more RMSNorms, the lm_head — all wired together into one massive DAG. The backend then runs it. The viz below walks through the high-level shape:

After the PR: a hook is added inside process_ubatch that grabs the trunk's hidden states and feeds them to the MTP heads (which already live in the GGUF for Qwen3.6). The K speculative tokens are then verified in one additional parallel trunk pass — and accepted/rejected per the Leviathan rule from §S.1. The wall-clock gain on Qwen3.6-27B: 7.0 → 21.6 tokens/sec.

llama.cpp's standard inference pipeline (top three boxes are one-time setup; the bottom loop runs per token). PR #22673 adds the MTP hook inside the per-token loop — same trunk, same hidden states, plus K extra heads consulted before the next standard step.

Toggle “before” and “after” the PR to see where MTP plugs in.

mock_ggml.c (key part) C · 200-line mock-ggml graph + add-a-head

static void execute(Node* out) {
    if (!out->data) {
        if (out->a) execute(out->a);
        if (out->b) execute(out->b);
        execute_node(out);
    }
}

static void print_tensor(const char* label, Node* n) {
    printf("%s [%dx%d]:\n", label, n->rows, n->cols);
    for (int i = 0; i < n->rows; i++) {
        printf("  ");
        for (int j = 0; j < n->cols; j++) printf("%7.3f ", n->data[i * n->cols + j]);
        printf("\n");
    }
}

int main(void) {
    /* INPUTS — these are tensors the graph reads from. */
    float x_data[] = {1.0f, 2.0f, -1.0f, 0.5f};                            /* (1, 4) */
    float W_data[] = {0.1f, 0.2f, 0.3f,
                      0.4f, 0.5f, 0.6f,
                      0.7f, 0.8f, 0.9f,
                      1.0f, 1.1f, 1.2f};                                   /* (4, 3) */
    float b_data[] = {0.0f, 0.1f, -0.2f};                                  /* (1, 3) */

    Node* x = new_input("x", 1, 4, x_data);
    Node* W = new_input("W", 4, 3, W_data);
    Node* b = new_input("b", 1, 3, b_data);

    /* BUILD the graph for `out = matmul(x, W) + b` (the "lm_head") */
    Node* xW  = new_matmul("xW",  x, W);
    Node* out = new_add   ("out", xW, b);

    printf("=== single-head graph (analogue of lm_head) ===\n");
    printf("graph: input(x) → matmul → add(bias) → output\n\n");
    execute(out);
    print_tensor("out (standard lm_head)", out);

    /* NOW — what does "adding a head" mean?
     * Add a SECOND output that consumes the same hidden state x (or
     * an intermediate node) but multiplies against a different weight
     * matrix. This is what an MTP head does: same trunk output, new
     * weight matrix per offset. */

The kernel builds a single-head graph (analogue of lm_head), executes it, then extends the same graph with a second head consuming the same trunk output — exactly the operation pattern of adding an MTP head. The output:

=== single-head graph (analogue of lm_head) ===
out [1x3]:  0.700  1.050  1.000

=== two-head graph (analogue of trunk + MTP head 1) ===
out [1x3]:        0.700  1.050  1.000
mtp_head_1 [1x3]: 0.350  0.450  0.200

→ Both heads consumed the SAME trunk output x without re-running the trunk.

That last line is the entire engineering point. In a real transformer, the “trunk output” is the post-final-RMSNorm hidden state (~4096 floats per token). Both lm_head and the MTP heads consume it; the trunk runs once.

— think, then check —

ggml is build-then-execute: you first construct a graph of operations (no compute), then hand the whole graph to a backend that schedules and runs it. NumPy/eager-PyTorch are immediate — each operation computes the result as soon as you write it.

Portability: the graph is a hardware-independent description of work. The same graph builder runs on CUDA (which dispatches to cuBLAS / custom CUDA kernels), Metal (Apple’s MPS), Vulkan, or CPU (with SIMD). The backend chooses the kernel implementation for each op. So model code is written once and the backend layer absorbs all hardware variation. This is why llama.cpp can support Qwen3, LLaMA, DeepSeek, etc. on every platform under the sun without per-platform model code — only per-platform backend code.

↳ §S.3 ggml model

What’s actually in a GGUF

A GGUF file is a self-describing binary container. Three regions, in order:

GGUF header magic = "GGUF" version = 3 tensor_count = N metadata_count = M Metadata key-value table (key → typed value) general.architecture = "qwen3" general.name = "Qwen3.6" qwen3.embedding_length = 4096 qwen3.attention.head_count = 32 qwen3.block_count = 80 tokenizer.ggml.model = "gpt2" tokenizer.ggml.tokens = ["<|endoftext|>", "the", "of", ...] general.quantization_version = 2 ... hundreds more keys ... Tensor descriptor table (name → shape + dtype + byte offset) token_embd.weight (152064, 4096) F16 offset=... blk.0.attn_q.weight (4096, 4096) Q4_K_M offset=... blk.0.attn_k.weight (4096, 1024) Q4_K_M offset=... ... output.weight (4096, 152064) Q6_K offset=... mtp.0.head.weight (4096, 152064) Q6_K offset=... ← MTP heads here mtp.1.head.weight (4096, 152064) Q6_K offset=... mtp.2.head.weight (4096, 152064) Q6_K offset=... Tensor data raw bytes for every tensor declared above, in offset order — mmap-able

Notice the last block of tensor descriptors: mtp.0.head.weight, mtp.1.head.weight, etc. These are the MTP heads, shipped inside the same GGUF as the main model. A model trained with MTP (Qwen3.6, DeepSeek-V3) already has these tensors; they were written by the GGUF converter. Before PR #22673, llama.cpp’s graph builder ignored them — they were unreferenced bytes in the file. The PR is, in essence, “consult these tensors when building the graph.”

Terminology drift · then → now

binary weight files (PyTorch .pt, TF SavedModel) → GGUF · self-describing tensor container

PyTorch’s .pt files are pickled Python objects — they need the original Python class definitions to load, which makes deployment across language ecosystems annoying. TensorFlow’s SavedModel is similar but with protobuf metadata. GGUF (introduced 2023, evolved through versions 1–3) is a pure binary format: a header, a typed key-value metadata table, and a flat list of named tensors with shape, dtype, and byte offset. No Python required to load it. Reading a GGUF is mmap + offset arithmetic; the tensor data can be read directly into the backend’s memory without copying. That zero-overhead loading is one of the reasons llama.cpp boots a 27B model in seconds rather than minutes.

— think, then check —

(1) Header: magic bytes “GGUF”, version, count of tensors and metadata entries.

(2) Metadata key-value table: typed (string, int, float, bool, array) key-value pairs describing the model architecture, tokenizer, quantization scheme, etc. Self-describing — anyone reading the file can determine its architecture without external schemas.

(3) Tensor table + raw data: each tensor has a name, shape, dtype, byte offset into the raw data region. The raw region is mmap-able — loading a model is mmap + pointer arithmetic, with zero parse overhead.

vs PyTorch .pt: a .pt is a pickled Python object. Loading requires (a) Python runtime, (b) the model’s Python class definition available in scope, (c) a deserialization pass. None of which work in a C/C++ runtime or on memory-constrained systems. GGUF needs nothing but a few freads and an mmap — which is why llama.cpp can boot a 27B model in 2 seconds on consumer hardware.

↳ §S.3 GGUF

”Adding a head,” concretely

What does it take to add an MTP head to llama.cpp’s graph for Qwen3? Roughly three things:

(1) Loader: declare the new tensor names so they’re loaded. The model loader’s tensor-name registry gets entries like mtp.0.head.weight, mtp.1.head.weight, etc. Without this, the GGUF’s MTP tensors would be visible (the file still parses) but unbound to any model-level handle. This is a registration change — a few lines.

(2) Graph builder: extend build_qwen3 to add the head ops. After the trunk’s final hidden state node is produced, append N new matmul ops, one per MTP head, each consuming the same hidden state and producing its head’s logits. This is the architectural change — typically 20–50 lines of code per head type.

(3) Inference loop: route the new heads’ outputs to the speculative-decoding logic. When process_ubatch runs the graph, it now has access to N additional logit tensors. The speculative loop samples K tokens from them, runs a verifier pass, applies the Leviathan rejection rule, and decides how many tokens to commit. This is the wiring change — the largest piece, since it touches the sampling and context-management code.

PR #22673 does all three. The “hook” the PR description mentions is in step (3): a callback that the per-ubatch processing function invokes after the trunk’s hidden state is ready, allowing the MTP code to consume it. This callback design is what lets MTP run as an optional path — if MTP isn’t enabled, the hook is a no-op and the runtime behaves exactly like before.

Why this is a hook, not just an inline call. The PR’s author chose a hook rather than just inserting MTP code directly into the loop because llama.cpp has to support many model families (LLaMA, Qwen, Mistral, Gemma, DeepSeek, Phi, …) and only some of them ship with MTP heads. A hook lets the MTP path opt-in per architecture — if Qwen3.6 has MTP, register the hook; if LLaMA-3 doesn’t, don’t. The trunk code stays clean. This is good engineering: extension by composition rather than by branching, the same pattern you’ll see in vLLM’s “spec_decoder” pluggable component or TensorRT-LLM’s “speculative” engine plugins.

What PR #22673 reports

Real numbers from the PR description, on Qwen3.6-27B with 3 MTP heads:

Metric	Without MTP	With MTP (3 heads)
Generation throughput	7.0 tokens/sec	21.6 tokens/sec
Acceptance rate	—	72.18%
Wall-clock for benchmark suite	201 s	83.8 s
Speedup	1×	~3×

Plug those numbers into the §S.1 formula: speedup = (p(1−pᴷ)/(1−p) + 1) / (1 + K · c_d) with p ≈ 0.72, K = 3, and the very low c_d that MTP enables (heads run almost-free on top of the trunk), the predicted speedup lands ~2.5×–3×. The empirical 3× matches.

The PR also calls out the operational caveats — currently n_parallel = 1 (the speculative path doesn’t yet compose with batched serving), and there’s a small prompt-processing penalty from device-to-host hidden-state transfers when the MTP heads consume them. Both are listed for future optimisation. This is what shipping a real production-grade feature looks like: the algorithm works, the speedup is real, and there are a few rough edges to polish before it’s the default for everyone.

— think, then check —

Loader change (~10 LoC): register the GGUF tensor names mtp.0.head.weight, etc. so the existing model-loader machinery picks them up and binds them to ggml tensor handles. Pure plumbing.

Graph builder change (~50 LoC per model family): in build_qwen3(), after the trunk’s final hidden state node is produced, attach N additional matmul ops — one per MTP head — that consume the same hidden state and emit per-head logits. The added graph nodes are no-ops if MTP isn’t enabled at runtime (their outputs are simply unread).

Inference loop change (the bulk of the PR): add a hook in process_ubatch that, after the trunk forward pass completes, optionally invokes the MTP path: sample K speculative tokens from the head outputs, run a verifier pass over them, apply the Leviathan rejection rule, commit the accepted prefix.

Why a hook rather than inline: llama.cpp supports ~15 model families (LLaMA, Qwen, Mistral, Gemma, DeepSeek, Phi, Mixtral, Command-R, …) and only some of them ship with MTP heads. Inlining MTP code into the per-ubatch path would (a) clutter the hot loop with conditional checks, (b) require every model family’s graph builder to know about MTP-specific concepts even when they don’t have MTP, and (c) entangle the speculative-decoding logic with the normal sampling path in a way that would make future variants (Medusa, EAGLE, n-gram speculation) hard to add. A hook makes MTP one component the runtime composes in optionally. Same pattern used in vLLM’s spec_decoder plugin interface and TensorRT-LLM’s speculative-engine plugins — production runtimes need composition, not entangled monoliths.

The Ch.2 §1 connection: just as a matrix is a function (not a 2-D array), an MTP head is a function that consumes the trunk’s hidden state. The hook is the function-application point. The whole picture — trunk → hidden state → multiple downstream heads each consuming that state — is the “matrix as composed functions” pattern from Ch.2, made concrete in code: every head is a linear function (matmul against its weight matrix), and they all compose with the same trunk output, in parallel.

↳ §S.3 PR walk + Ch.2 §1

END OF SPECIAL CHAPTER S — Speculative decoding & multi-token prediction.

§S.1 (the autoregressive bottleneck + Leviathan rejection rule) · §S.2 (Medusa / EAGLE / MTP variants) · §S.3 (llama.cpp internals + PR #22673 walk).

Three kernels (spec_sim.c, mock_ggml.c, and the §S.1 throughput sim) and three Svelte visualisations grounding the whole picture. The chapter walks from “why is LLM inference slow?” to “what does PR #22673 actually change?” with both the math (acceptance probability, expected speedup, rejection-sampling correctness) and the engineering (ggml graph pattern, GGUF layout, hook composition).

This chapter sits in Part S — Spotlight: production accelerators, an out-of-sequence chapter inserted at the user’s request to track late-2025 production developments. We’ll fold this material back into Ch.22 (Inference at scale) when we get there in sequence, but the standalone version stays as a self-contained reference. Next: back to Part II, Ch.7 — Random projections and the Johnson-Lindenstrauss lemma.