The decoder-only stack — Llama 2 7B, end to end
Ch.11-14 assembled every piece of a transformer block: embeddings, attention with FlashAttention, normalisation, residual stream. Time to put them together. This section walks the forward pass of a real decoder-only LLM — Llama 2 7B, the canonical reference of the post-Chinchilla era — token by token, layer by layer. We’ll trace shapes, count parameters, and end with logits ready for sampling. Everything is shape-checked against the public weights.
Llama 2 7B at a glance
SwiGLU is the FFN activation Llama uses instead of GELU. It’s gated — one branch goes through Swish, another branch is a pure linear projection, the two are multiplied elementwise. Shazeer 2020 “GLU Variants Improve Transformer” showed SwiGLU is the best of several gated variants; every modern model uses it.
The full forward pass
For a single sequence of tokens t_1, t_2, …, t_N, the model produces logits over the vocabulary for the next token at every position. Here’s the complete shape-traced forward pass:
That’s the whole architecture. Every component on the right-hand side of every line is in Ch.11-14. The transformer is exactly this — an embedding lookup, 32 alternating attention-and-FFN blocks each with pre-RMSNorm and a residual addition, a final norm, and a tied unembedding.
Tied embeddings — the (V × d) embedding table E is reused (transposed) for the output projection. The whole vocab-projection layer becomes parameter-free; logits = X · Eᵀ is just a matmul against the existing embedding.
Block input x. Block output z (passed to next block).
- x_norm = RMSNorm(x) — pre-norm before attention
- y = x + Attention(x_norm) — attention sublayer with residual add
- y_norm = RMSNorm(y) — pre-norm before FFN
- z = y + FFN(y_norm) — FFN sublayer with residual add
Two RMSNorms (one before each sublayer); two residual additions (one after each); no LayerNorm at the end of the block — the next block’s pre-norm handles it. The Attention sublayer is multi-head with RoPE applied to Q, K, run through FlashAttention. The FFN sublayer is SwiGLU.
Shape preserved throughout: x, x_norm, y, y_norm, z all have shape (N, d_model). The residual stream never changes shape; it just gets additive contributions from each sublayer.
The parameter count
Let’s count Llama 2 7B’s parameters from scratch to verify the math:
The FFN is bigger than attention per block (135 M vs 67 M). For Llama 2 7B, FFN parameters are about 65% of the block; attention is about 35%. This is typical — modern transformers spend most parameters in the FFN, not attention.
Per block, with d = 4096, F = 11008:
- Attention: W_Q, W_K, W_V, W_O = 4 · d² = 4 · 16.78M = 67M
- FFN: W_gate (d·F) + W_up (d·F) + W_down (F·d) = 3 · d·F = 3 · 45.1M = 135M
Per block: 67M + 135M = 202M. Ratio: FFN / total = 135/202 = 67%, attention = 33%.
Why FFN dominates:
Attention has 4 parameter matrices, each d × d. FFN has 3 parameter matrices (SwiGLU), each d × F, with F ≈ 2.7 · d. So FFN’s parameter count is 3 · d · 2.7d = 8.1 · d² vs attention’s 4 · d² — a factor of 2× more.
The structural reason: the FFN dimension F is sized to be the BOTTLENECK width of the model in some sense. Each token’s residual-stream activation is “expanded” into a wider space (F = 2.7 · d) where the FFN does its nonlinear computation, then projected back. This “expand → nonlinearity → contract” pattern is where most of the model’s representational capacity lives. Attention is for moving information BETWEEN tokens; FFN is for computation WITHIN tokens. The “within-token computation” budget is bigger.
This proportion is roughly constant across modern transformers: Mistral 7B has the same 67/33 split; GPT-3 175B has a similar split; only outliers like deep narrow networks (very large L, small d, small F) push the ratio.
What you actually compute at inference (per generated token)
After the model is trained, generating a token from a prompt:
- Prefill the prompt: run the forward pass through all N prompt tokens. KV cache fills with (K, V) at every position. Cost: O(N²) for attention (per head, per layer); O(N · d²) for the matmuls.
- Decode one token: run the forward pass for ONLY the newest position. Q, K, V are computed for one new token; K and V are appended to the cache; attention is the new Q against ALL cached K. Cost per token: O(N · d) attention + O(d² + d · F) matmuls.
This is why prefill is much more expensive than decode for short generations. Most inference engines (vLLM, TGI, TensorRT-LLM) optimise these two phases differently.
Structural difference:
GELU FFN: h = GELU(x · W₁); out = h · W₂. Two matrices, ungated.
SwiGLU FFN: gate = Swish(x · W_gate); up = x · W_up; h = gate ⊙ up; out = h · W_down. Three matrices, gated.
The “gated” part is the difference. SwiGLU computes TWO separate projections of x (Swish-activated and linear), multiplies them elementwise, then projects back. This is a “multiplicative” interaction that GELU lacks.
Why multiplicative interactions matter:
In GELU, every output dimension is a weighted sum of GELU-activated inputs. The activation acts as a “soft switch” per dimension. In SwiGLU, the Swish gate ⊙ up multiplication means each output dimension gets a per-input-pair product. This represents conjunctive “feature A AND feature B” relationships in one layer, where GELU would need two layers.
Empirically (Shazeer 2020): SwiGLU at the same parameter count beats GELU by ~0.5 perplexity points on standard LM benchmarks. So you don’t get SwiGLU as a free win — you adjust F (the FFN hidden dim) downward to compensate for the 3 matrices vs 2.
For Llama 2 7B: F = 11008. If they’d used GELU at the same parameter count, they’d have set F ≈ 16384 (since GELU has 2 matrices vs SwiGLU’s 3). The model would have been roughly the same parameter count either way; SwiGLU’s choice was “trade some FFN width for gated structure” and it paid off.
Why a small win is worth the complexity:
At LLM scale, every 0.5 perplexity point of model quality compounds into noticeable downstream task improvement. The same model at 1B vs 1.05B parameters costs ~5% more to train and run; if SwiGLU gives ~0.5 perplexity at parameter parity, that’s like getting 1-2% more model “for free.” Worth it.
This is a recurring pattern in modern architecture: incremental gated/multiplicative variants of the FFN (SwiGLU, GeGLU, ReGLU) each give small but compounding wins; the field has converged on SwiGLU as the sweet spot of complexity vs gain.
Next: §15.2 — Encoder vs decoder vs encoder-decoder. Why “predict next token” turned out to be a universal task, and why every modern frontier model is decoder-only.