Tokens & embeddings

§1 BPE tokenisation
Text → integer IDs. Byte-Pair Encoding (BPE) starts with characters, greedily merges the most-frequent adjacent pairs until the vocabulary reaches the target size. Most production LLMs use BPE or a close variant (WordPiece, Unigram, SentencePiece) with 32K–128K vocab. The choice of tokeniser affects everything downstream — sequence length, embedding count, model behaviour on rare or non-English text.
§2 The embedding table
A learned matrix W_embed (vocab × d) maps each token ID to a dense vector by row lookup. Implemented as a gather operation, not a matmul. Often tied with the output projection to save parameters. For a 4096-dim model with vocab 128K, the embedding table is ~500M params — comparable to a full attention sublayer.
§3 Positional encoding → RoPE
Transformer attention is position-agnostic — it treats input tokens as a set, not a sequence. Adding position information is required. The lineage: sinusoidal (Vaswani 2017) → learned (early GPT) → RoPE (Su 2021), the modern default in Llama, Mistral, Qwen. RoPE rotates each pair of embedding dimensions by a position-dependent angle; it preserves the dot product structure that attention scores on.