Positional encoding → RoPE

Section 11.3

Positional encoding → RoPE

Transformer attention has a structural quirk you can spot once you see it: it doesn’t know what order the tokens came in. The attention operation (Ch.13) reads a set of tokens, computes pairwise scores, and weights them — but a permutation of the inputs produces the same set of pairwise interactions. To distinguish “the cat sat on the mat” from “the mat sat on the cat” the model needs position information injected somewhere. The original transformer (Vaswani 2017) added a sinusoidal position vector to the token embeddings. Later models tried learned position embeddings. Rotary Position Embedding (RoPE — Su 2021) is the modern winner: it encodes position by rotating pairs of embedding dimensions by a position-dependent angle. RoPE preserves the dot product structure attention scores on (Ch.2 §3’s orthogonal-invariance result, reused), generalises naturally to context lengths longer than training, and has clean extension properties for arbitrary long-context applications. Every frontier LLM since Llama-1 (2023) uses RoPE or a structural variant.

Why the model needs position information

Attention’s core operation looks like this (full treatment in Ch.13):

scores_{ij} = ⟨q_i, k_j⟩ / √d weights_{ij} = softmax_j(scores) output_i = Σ_j weights_{ij} · v_j (q, k, v are linear projections of the input embeddings)

The score between token i and token j depends only on their respective embeddings, not on their positions. If you permute the inputs, you permute the rows/columns of the scores matrix — the values of every ⟨q_i, k_j⟩ are unchanged, only their assignment to (i, j) positions shifts. So the attention layer is permutation-equivariant on tokens — without explicit position info, “the cat sat” and “sat cat the” produce the same outputs (just reordered).

For most language tasks, word order carries critical meaning. The fix: inject position information into the input vectors before they reach attention.

— think, then check —

MLPs: The input is a flat vector — there’s no ‘order’ to forget. If you reshape a 2D image into a 1D vector and shuffle the pixels, the MLP can still learn to classify, but it has lost spatial structure. Position is implicit in the input ordering.

CNNs: Convolutions slide a filter over the input with a known stride and pattern. The filter’s output at position i directly encodes ‘what does the input look like at position i?’ Position is implicit in the convolution’s iteration order.

Transformers (attention): Attention computes pairwise interactions ⟨q_i, k_j⟩ between every pair of tokens. The operation is permutation-equivariant — shuffling the input tokens just shuffles the output tokens correspondingly, with no change in their values. Without explicit position info, the model literally cannot tell which token came first.

So positional encoding is unique to transformers among these three architectures. RNNs / LSTMs also have implicit position (sequential processing); they don’t need explicit position encoding for the same reason CNNs don’t.

↳ §11.3 motivation

Sinusoidal (Vaswani 2017) — the original

The original “Attention Is All You Need” used fixed sinusoidal position vectors, added directly to the input embeddings:

PE(pos, 2k) = sin(pos / 10000^(2k / d)) PE(pos, 2k+1) = cos(pos / 10000^(2k / d)) x'_t = x_t + PE(t) ← add position vector to the token embedding (t is position, k indexes embedding dimensions, d is embedding dim)

The key intuition: different embedding dimensions have different “frequencies” (10000^(2k/d) varies by orders of magnitude across k), so the joint pattern across all dimensions uniquely encodes any position. Sinusoidal PEs have a nice extrapolation property — the dot product between PE(pos1) and PE(pos2) depends only on (pos2 − pos1), giving the model a relative-position signal automatically.

This worked. It was the default in the original BERT, T5, etc.

Learned positional embeddings — early GPT

A simpler alternative: train a separate embedding matrix Wₚₒₛ of shape (max_length × d), one row per position. Add the row corresponding to the current position to the token embedding. Used in GPT-1, GPT-2, BERT, RoBERTa.

The catch: learned position embeddings have a fixed maximum length. Train at 512 tokens, you can only inference at 512 tokens — past that, there’s no row to look up. Sinusoidal PE generalises to arbitrary length; learned PE doesn’t.

Rotary Position Embedding — Su 2021

Su, Lu, Pan, Wen & Liu 2021 (“RoFormer: Enhanced Transformer with Rotary Position Embedding,” arXiv:2104.09864) proposed a different mechanism that has displaced both earlier approaches. RoPE modern default Rotary Position Embedding (Su et al. 2021). Encodes token position by rotating each pair of embedding dimensions (2k, 2k+1) by an angle m · θ_k where m is the position and θ_k = base^(-2k/d) is a per-pair frequency. Applied to the q and k vectors before attention. Preserves the dot-product invariance attention scores on, naturally encodes RELATIVE position (the rotation between m and n depends only on m − n), and has clean extension properties for long contexts (NTK scaling, YaRN). Default in Llama, Mistral, Qwen, DeepSeek, Gemma, Phi, and almost every modern LLM since 2023. rotates pairs of embedding dimensions by a position-dependent angle:

For each pair of dimensions (2k, 2k+1): Treat (x[2k], x[2k+1]) as a 2D vector. Rotate it by angle m · θ_k where m is the position and θ_k = base^(-2k/d). In matrix form per pair: R(m · θ_k) = ⎡ cos(m θ_k) −sin(m θ_k) ⎤ ⎣ sin(m θ_k) cos(m θ_k) ⎦ Apply to embedding x at position m: x'_m = R_m(θ) · x_m (block-diagonal rotation, one 2D rotation per pair)

In practice we only apply RoPE to the q and k vectors that go into attention — not the full embeddings, and not v. The viz shows the operation: each pair of dimensions is its own 2D plane; the position rotates that plane by its own frequency.

position m 2.0 log₁₀(base) 2.5 (base = 316)

pair 0 · θ = 1.0000

pair 1 · θ = 0.2371

pair 2 · θ = 0.0562

pair 3 · θ = 0.0133

Each pair of embedding dims (2k, 2k+1) gets its own rotation frequency θₖ = base^(−2k/d). Position m rotates pair k by angle m · θ_k. Pair 0 rotates fast (highest frequency); pair 3 rotates slowly. Different positions produce different combinations of rotations across the pairs — the model can read absolute position by looking at the joint state.

RoPE rotates each pair of embedding dimensions by an angle proportional to the token's position. Slide m: pair 0 (left) rotates fast, pair 3 (right) barely moves. The geometric progression of frequencies — controlled by base — is what lets a single multi-rotation encode positions over very long ranges (32K, 128K, 1M+ in modern long-context models).

Slide the position m. Watch pair 0 (high frequency, fast rotation) and pair 3 (low frequency, slow rotation) move at very different rates. The full “RoPE-applied” vector at position m is the joint state across all 4 pairs — that joint state uniquely encodes the position.

Three reasons RoPE wins

(1) Dot product preservation. This is the connection back to Ch.2 §3. RoPE’s rotation is an orthogonal transformation — R · Rᵀ = I. So ‖x‖ and ⟨x, y⟩ are both preserved between the original and rotated vectors. Attention scores aren’t disturbed by the rotation itself; only by the difference between rotation angles at different positions.

(2) Relative position falls out automatically. The attention score between rotated queries and keys becomes:

⟨R_m · q, R_n · k⟩ = qᵀ R_m^T R_n · k = qᵀ R_{n-m} · k (because R_m^T R_n = R_{n-m} for rotations) So the attention score depends only on the RELATIVE position (n − m), NOT on the absolute positions m and n separately.

This is a structural property the model gets for free — the dot product between a query at position m and a key at position n depends only on their difference. Pre-RoPE, models had to learn this relative-position bias; with RoPE, it’s encoded in the geometry.

(3) Long-context extension. Train at 4K context. At inference, encounter a 16K-token document. With learned PE: impossible (no row for position 5000). With RoPE: the rotation formula is parametric in position, so it extrapolates — though out-of-training-range positions are noisy. Modern extensions (NTK scaling, YaRN) refine this further to push context to 1M+ tokens reliably. Llama-3 trains at 8K and inferences at 128K via these scaling tricks.

— think, then check —

A single frequency would only encode position modulo the period. A token at position m and a token at position m + 2π/θ would be encoded identically — the rotation would wrap around. The model could not distinguish them.

The geometric progression θ_k = base^(-2k/d) gives each pair a different period. Pair 0 has the highest frequency (shortest period); pair d/2-1 has the lowest frequency (longest period — for d = 64, base = 10000, the longest period is ~10000 tokens). Different pairs disambiguate different scales.

At position m, the joint state across all pairs is a unique vector in the ‘rotation product space.’ To find a colliding position m’, every pair would need to be at the same angle — which means m’ − m must be a common multiple of every pair’s period. With geometrically-spaced periods, the smallest common multiple is enormous (in the standard config, far longer than any training sequence).

This is the same trick used in clocks: hours, minutes, seconds at different rates uniquely encode time within a 12-hour period. RoPE does the same thing across d/2 ‘clocks’ at different rates, with positions where all hands agree being separated by impractically-long sequences.

↳ §11.3 RoPE mechanics

The RoPE recipe in code

In practice, RoPE is implemented as a per-token preprocessing step before attention:

// Apply RoPE to a single token's query (or key) vector at position m.
// q: input vector of size d. The output overwrites q in place.
void rope_apply(float* q, int d, int m, float base) {
    for (int k = 0; k < d / 2; k++) {
        float theta_k = powf(base, -2.0f * k / d);
        float angle = m * theta_k;
        float c = cosf(angle), s = sinf(angle);
        float x = q[2*k], y = q[2*k+1];
        q[2*k]   = c * x - s * y;
        q[2*k+1] = s * x + c * y;
    }
}

That’s it. ~10 lines. The standard implementation precomputes the sin/cos tables once at model load time, so the per-call cost is just d multiply-adds — negligible compared to the actual attention matmul.

The connection back to Ch.2 §3 is operational, not just analogical. Ch.2 §3 said orthogonal Q preserves dot products: ⟨Qx, Qy⟩ = ⟨x, y⟩. RoPE applies a block-diagonal orthogonal Q (each block a 2D rotation) — but with the rotation angle depending on the token’s position. Same invariance, different per-token Q. The fact that attention scores can use the RoPE-rotated q and k vectors directly without breaking dot-product geometry is exactly because of the Ch.2 §3 result. Without orthogonality, the rotated scores would be a different — and probably worse — function of position. The math fell out beautifully.

— think, then check —

R_m and R_n are block-diagonal rotation matrices (one 2D rotation per pair, by angles m·θ_k and n·θ_k respectively). Each block is orthogonal: R_m^T R_m = I.

Compute:

⟨R_m·q, R_n·k⟩ = (R_m·q)^T · (R_n·k) = q^T · R_m^T R_n · k.

For rotation matrices: R_m^T R_n = R₍ₙ₋ₘ₎ (rotation by angle (n−m)·θ_k per pair). So:

⟨R_m·q, R_n·k⟩ = q^T · R₍ₙ₋ₘ₎ · k.

This depends on m and n ONLY through their difference (n−m). Absolute positions don’t appear.

Why it matters: the attention score between two tokens depends only on their RELATIVE position (how far apart they are in the sequence), not on their absolute positions. So:

The model learns relative-position patterns (‘attention to the previous noun phrase’) rather than absolute-position patterns (‘attention to position 47’), which generalises much better.
Context length extension (training at 4K, inferring at 128K) works because the relative-position math is uniform — there’s nothing about ‘position 47000 specifically’ for the model to be uncertain about; only ‘this token is 200 positions before that token.‘
Sliding-window attention, RoPE prefix-caching, and long-context optimisations all compose cleanly with relative-position encoding.

This is one of the cleanest applications of Ch.2 §3’s orthogonal-invariance identity to a production architecture decision. The ‘dot product preserved by orthogonal Q’ result that drove TurboQuant (Ch.25) is the same identity behind RoPE’s relative-position property. Two different production systems leaning on the same one-line Ch.2 §3 fact.

↳ §11.3 RoPE invariance + Ch.2 §3 orthogonal Q

END OF CH.11 — Tokens & embeddings.
§1 (BPE tokenisation) · §2 (embedding table) · §3 (RoPE positional encoding).

We now have text → token IDs → vectors → vectors-with-position-info. The transformer’s input pipeline is assembled. Coming next: Ch.12 — Softmax & the exponential family. Numerical stability, cross-entropy, and the online-softmax recurrence that unlocks FlashAttention in Ch.13.