Encoder vs decoder vs encoder-decoder
2018–2022 was a three-way race for the dominant transformer architecture. BERT — encoder-only, bidirectional attention, trained with masked language modelling — was the giant of 2018-2020 for classification, QA, and feature extraction. T5 — encoder-decoder with a separate encoder and an autoregressive decoder, trained with span corruption — was Google’s bet on a unified text-to-text task. GPT — decoder-only with causal attention, trained on next-token prediction — was the OpenAI bet that everyone else was wrong about. By 2023, the race was over. Every frontier model is decoder-only. This section explains why.
The three architectures
Causal mask is the load-bearing detail. Without it, “predict next token” is trivial — the model sees the next token in the input and outputs it. With the lower-triangular mask, each position is forced to predict based only on what came before. This single mask turns one forward pass on a length-N sequence into N independent next-token-prediction tasks, all learned in parallel.
The mask is implemented as a constant lower-triangular matrix of zeros (j ≤ i) and −∞ (j > i) added to the score matrix before softmax. Costs nothing to compute; reduces the effective attention pattern by half (only the upper-triangle is wasted compute, which can be skipped in optimised kernels).
The mask: a lower-triangular matrix. For position i attending to position j, the mask sets S[i,j] = 0 if j ≤ i (past or current position), or −∞ if j > i (future position). The mask is added to the attention scores BEFORE the softmax.
Effect: softmax applied to a row with some −∞ entries produces zeros at those positions. So position i’s attention weight on position j is 0 whenever j > i — position i cannot “see” any future token.
Why needed for next-token prediction:
The training objective is “given tokens 1..i, predict token i+1.” If position i could see token i+1 in its attention, the task is trivial (the answer is in the input). The causal mask FORCES position i to predict from positions 1..i only, making the task non-trivial and forcing the model to learn meaningful representations.
N training signals in one forward pass:
At position 1, the model predicts token 2 from token 1’s embedding.
At position 2, the model predicts token 3 from tokens 1..2.
At position i, the model predicts token i+1 from tokens 1..i.
One forward pass through the decoder produces N hidden states (one per position); each is followed by an unembedding to logits and a cross-entropy loss against the next token. Loss = mean of N per-position losses. The causal mask ensures these N predictions are independent (each only depends on the past), so they can all be computed in parallel within one forward pass.
This is why decoder-only transformers are so training-efficient: a length-N sequence gives N supervised examples for free. BERT only gets ~0.15·N (the 15% masking rate). T5 gets a similar fraction. Decoder-only models have 5-6× higher effective training-signal density per token of compute.
The training objectives, compared
This is the most underappreciated reason decoder-only won: training-signal density. For the same training data and compute, a decoder-only model gets ~6× more supervised signal per token than BERT or T5. The empirical consequence: at any compute budget, decoder-only models reach lower perplexity faster.
BERT’s training signal:
BERT randomly masks 15% of input tokens. For each masked position, the model produces a prediction and computes a cross-entropy loss. Tokens that aren’t masked produce no loss (their representations are computed, but no supervision).
So per N input tokens, BERT gets ~0.15 · N supervised predictions.
GPT’s training signal:
GPT predicts the next token at EVERY position. The causal mask ensures each position only sees prior tokens, so the prediction at each position is non-trivial. Per N input tokens, GPT gets N − 1 ≈ N supervised predictions (every position except the last has a “next token” target).
The ratio: N / 0.15·N = 6.7×. GPT gets ~6.7× more loss-contributing predictions per token of input data.
Consequence for compute efficiency:
To learn a useful representation, the model needs a certain TOTAL number of supervised predictions. BERT needs 6.7× more input tokens to get the same total signal that GPT extracts from 1× input tokens. Equivalently: for the same data budget, GPT learns 6.7× more.
At Chinchilla-scale (Ch.16 §3), this matters a lot. A 70B-parameter model needs ~1.4T training tokens to be compute-optimal as a decoder-only model. The same 70B as BERT-style would need ~9T tokens of equivalent signal — much more data, much more compute.
This isn’t the only reason decoder-only won (generation, in-context learning, simpler serving all matter too), but it’s the underrated one. Per dollar of training compute, you get a stronger model by giving it dense supervised signal.
Why decoder-only won
Three structural reasons:
-
Training-signal density (above) — 6× more loss per token = much more efficient training.
-
Unified prefix — a decoder-only model is structurally just “given a prefix, predict the suffix.” Any task can be cast in this form: classification (“Question: X. Answer:”) → next-token; QA (“Context: … Question: … Answer:”) → next-token; translation (“English: X. French:”) → next-token. In-context learning — the ability to learn a task from a few examples in the prompt — is most natural in a decoder-only architecture.
-
Simpler serving — one model, one forward pass, one KV cache. T5’s encoder-decoder requires running the encoder once per request then the decoder per token; the KV cache is more complex; the deployment overhead is higher. Decoder-only is operationally simpler at every level.
The strongest reason: emergence of in-context learning and task unification.
GPT-3 (Brown 2020) demonstrated that a sufficiently large decoder-only model could perform tasks given only examples in its prompt — no fine-tuning, no task-specific architecture. This was unique to autoregressive LM: the model “reads” the prefix as if it were just more training data, and the autoregressive structure naturally conditions on it.
BERT can’t do this: its bidirectional attention assumes a fixed format (with [MASK]s in specific places). T5 can do it weakly but the encoder-decoder split makes the prefix conditioning awkward.
Once in-context learning worked, the case for fine-tuning a separate model per task evaporated. One big decoder-only model + prompt engineering replaced thousands of fine-tuned BERTs.
Supporting reasons:
- Training signal density. Decoder-only gets ~6× more loss signal per input token (every position predicts the next). BERT/T5 get ~15% of tokens supervised. Same data budget → 6× more learned per token.
- Generation is what users want. Chat and content creation are autoregressive by nature. BERT can’t generate; T5 can but requires more infrastructure.
- Simpler serving. One forward pass, one KV cache, no encoder/decoder split. Operationally clean. Cheaper to deploy at scale.
- Scaling laws favoured decoder-only. Chinchilla and follow-up scaling-law work found that as you scale, the decoder-only architecture continues improving smoothly; BERT-style scaling plateaus earlier (less to learn from each token).
What BERT/T5 had that didn’t matter:
- Bidirectional attention (BERT) — better for fixed-format classification, but classification with a decoder-only LLM via prompting matches BERT performance on standard benchmarks at sufficient scale.
- Cleaner classification head (BERT) — at small scale, BERT was better. At large scale, the gap closes.
- Unified text-to-text framing (T5) — turned out to be redundant; decoder-only does this naturally.
- Encoder-side bidirectional context (T5) — useful for source comprehension in translation, but in-context learning provides this through prefix conditioning at scale.
The lesson: architecture differences that mattered at small scale (where data is limited) stopped mattering at large scale (where compute and data are abundant). The simpler architecture (decoder-only) won because it was simpler, more efficient, and reached the same capabilities at scale.
Next: §15.3 — Sampling. The model produces logits; the inference engine has to turn them into tokens. Greedy, temperature, top-k, top-p (nucleus), beam search, and speculative decoding (already covered in Ch.S).