TOKENS & EMBEDDINGS
Section 11.1
01

BPE tokenisation

The book up to here has worked with vectors and matrices — quantities whose entries are real numbers. But the inputs LLMs actually take are strings of text. The bridge is tokenisation: a deterministic procedure that splits text into a sequence of tokens, each of which has an integer ID drawn from a fixed vocabulary. The model never sees the original characters; it sees the sequence of IDs and looks up a vector for each (§11.2). For a returning systems engineer the analogy that fits best is “tokenisation is to text what UTF-8 encoding is to Unicode” — a fixed-rule mapping between an abstract representation (characters / scalar values) and a concrete representation (token IDs / byte sequences) the runtime can actually process. The choice of tokeniser is one of the load-bearing design decisions in any LLM stack, and it’s almost always BPE or a close variant.

Three things tokenisation has to balance

A good tokenisation has to navigate three tensions:

Three obvious extreme choices fail in different ways:

ProsCons
Word-levelOne token per word; small sequencesOOV words can’t be represented; vocab explodes with rare words; can’t handle non-space-separated languages (Chinese, Japanese)
Character-levelTiny vocab; perfect coverageSequences are 4–5× longer; attention cost grows quadratically (Ch.13); model must learn word-level structure from scratch
Subword (BPE)Common words one token; rare words split into pieces; bounded vocabMore complex; vocab size is a tuning knob

BPE — invented by Gage 1994 as a data-compression algorithm and adapted to NLP by Sennrich, Haddow & Birch 2016 (“Neural Machine Translation of Rare Words with Subword Units,” ACL) — solves the three-way trade. Almost every production LLM uses BPE or a structural cousin.

The algorithm

BPE training, in pseudocode:

Input: corpus (large text), target vocabulary size V Output: a merge list (an ordered sequence of pairs) 1. Initialise: every character is its own token. Vocab = {all characters}. 2. While |vocab| < V: a. Count the frequency of every adjacent token pair in the corpus. b. Find the most-frequent pair (left, right). c. Add the merged pair to the vocab as a new token. d. Append (left, right) → merged to the merge list. 3. Return the merge list. Encoding new text at INFERENCE time: 1. Split input into characters. 2. While any pair in the current token list is in the merge list: a. Find the merge with the EARLIEST learned priority. b. Apply it to the leftmost matching pair. 3. Return the resulting token sequence.

The viz below shows the encoding step at work. Type any text; click “next” to apply one merge at a time, or “play all” to run them in sequence:

step 0 / 8
step 0 — initial state (1 char per token)
tokens: 25 final: 17
thequickbrownfoxjumps
BPE starts by treating every character as a separate token. At each step it finds the most-frequently-co-occurring adjacent pair (learned during training) and merges it into a single new token. After enough merges, common subwords like the, ␣quick, ␣brown become single tokens. The number of merges determines the vocabulary size; modern LLMs use 32K–128K tokens.
Click next or play all. Each step applies the highest-priority merge from a tiny hand-crafted merge table (a real BPE trainer learns these from billions of tokens). Watch the 25-character input collapse to a small handful of subword tokens — that's the encode side of every tokenizer.

Each step finds the highest-priority pair in the merge list and combines it. Watch how common subwords (the, qu, brown) emerge as single tokens after several merges — that’s BPE’s compression in action.

— think, then check —

BPE training starts with character-level tokens and repeatedly merges the most-frequent adjacent token pair in the training corpus, adding the merged pair as a new vocab entry, until the vocab reaches a target size.

Structural property exploited: language has strong adjacent-pair statistics. Common letter pairs (th, in, er, on), common subwords (’ the’, ’ and’), common morphemes (-ing, -ed), and common whole words occur far more often than chance. By greedily merging the most-frequent pairs, BPE builds a vocabulary that gives one token to anything common and multiple tokens to anything rare — automatically matching the vocab to the data’s frequency distribution.

The result is that the average sequence length (in tokens) is much shorter than at the character level, but rare/novel inputs (technical jargon, non-English text, code identifiers) can still be represented by falling back to smaller pieces. Best of both worlds, modulo the vocabulary-size knob.

How big should the vocabulary be?

The standard sizes you’ll see in production models:

Model familyVocabulary size
GPT-2 (2019)50,257
GPT-3 (2020)50,257
Llama-1 (2023)32,000
Llama-2 (2023)32,000
Llama-3 (2024)128,256
GPT-4 (2023)~100K
Qwen3 (2025)152,064
DeepSeek-V3 (2024)129,280

The trend is upward — Llama-3’s 4× increase over Llama-2 reflected the design choice to handle multilingual and code data better. The cost: 4× the parameters in the embedding table (and the output projection, which is the same shape transposed). For a 4096-dim model, going from 32K vocab to 128K vocab adds 4096 · (128K − 32K) · 2 = 786M parameters to the model — comparable to a few attention layers.

The benefit: shorter sequences for the same text, so attention (Ch.13) costs less. For a transformer trained on internet text, vocab 128K typically gives ~3.5 characters per token vs vocab 32K’s ~4.5 characters per token — about 25% shorter sequences for the same content. Given attention’s quadratic cost, that’s a 1.5× compute savings on long sequences. Larger vocab buys compute savings at inference but costs more memory at every step; the tradeoff has favoured “bigger” as memory becomes more abundant and context windows grow.

— think, then check —

Larger vocab = shorter sequences for the same content. At 4096-dim, going from 32K to 128K vocab typically reduces the characters-per-token average from ~4.5 to ~3.5, a 25% sequence-length reduction on typical English / code / multilingual text.

Cost saved at inference time: attention is O(N²) in sequence length (without FlashAttention’s tricks), so a 25% length reduction is a ~44% compute reduction on the attention sublayer. For long-context models (Llama-3’s 128K context, Gemini’s 1M context), attention cost dominates everything else; sequence-length reductions at the tokeniser level translate directly to massive inference cost savings.

For a 70B model running at 128K context, ~25% shorter sequences = roughly 35% lower attention cost per request = 35% more throughput for the same hardware. That’s a much better trade than the ~1% extra parameter memory the bigger embedding table costs. As context windows grow, larger vocabularies become MORE valuable — which is why every new frontier model since 2024 has been pushing vocab size up.

Edge cases the tokeniser handles (and sometimes botches)

A non-exhaustive list of where tokenisers get interesting:

— think, then check —

(1) Sequence length explodes. Character-level tokenisation gives ~4-5x longer sequences than BPE. For a 128K-token context window in BPE terms, character-level would need 512K-640K tokens for the same content. Attention is O(N²) in sequence length without FlashAttention tricks; even with FlashAttention’s O(N) memory + O(N²) compute, the compute cost grows by 25× for a 5× longer sequence. Same context coverage, 25× more flops per forward pass.

(2) The model wastes capacity learning word-level structure. BPE gives the model ‘the’ as one token; character-level requires the model to learn that t-h-e is a word. Empirically, character-level models need much more training data to reach the same downstream performance — recent research (ByT5, Xue et al. 2022) shows ~2-3× more training compute for comparable scores on standard benchmarks. The model is rediscovering tokenisation as part of its training.

(3) Throughput on long-context retrieval, RAG, code is decimated. Code and structured documents are MORE sensitive to sequence-length amplification than prose — code has lots of long identifiers and structural tokens that BPE handles cleanly but character-level inflates. A 32K-line code review fits in BPE’s 128K context but exceeds character-level’s effective capacity, even with the wider context window.

The “character-level avoids tokeniser complexity” argument is appealing in research papers (no tokeniser-induced biases, perfect coverage) but operationally untenable for production LLMs serving users. BPE’s complexity is small in implementation (~500 lines of code in tiktoken or tokenizers) and the wins are large. The trade is settled for production; character-level survives in specific niches (low-resource languages, code understanding) where its uniformity outweighs the inefficiency.

END OF CH.11 §1 — BPE tokenisation.
Built: BPETokenize viz (interactive: type text, step through merges, see characters collapse into subword tokens). Three recall items: easy (algorithm + structural property), medium (vocab-size tradeoff with quantitative cost analysis), hard (character-level alternative critique).
Coming next: §11.2 — The embedding table. Token IDs become dense vectors via row lookup.