The dot product
Here is the atom. Every score in this book — the similarity between two embeddings, the QKᵀ entry inside attention, the distance computation inside HNSW — bottoms out in this single operation.
You know the mechanics. What’s worth re-seizing is the geometry, because it’s what makes the dot product the universal similarity measure:
Read that right-to-left and it tells a story: the dot product is large and positive when two vectors point the same way (cos θ → 1), zero when orthogonal (θ = 90°), and negative when they oppose. “Similarity” in an embedding space is just this angle. When you strip out magnitude — divide by both lengths — you get cosine similarity, the metric most text-embedding systems use by default.
The projection picture is the one to keep: ⟨a,b⟩ is the length of b’s shadow on a, scaled by ‖a‖. Attention scores are exactly this — “how much of the query points along this key.”
Now make it run — scalar first, then SIMD
A toy Python dot product teaches the math but hides the thing you actually care about: how the machine does it fast. We’ll always show both. Here is the honest scalar version, then the vectorized one.
static float dot_scalar(const float* a, const float* b, int n) {
float acc = 0.0f;
for (int i = 0; i < n; i++) acc += a[i] * b[i];
return acc;
}One multiply-add per element, fully serial. The CPU can do eight of these at once. The AVX version packs eight floats per register and uses one fused multiply-add intrinsic:
__m256 acc = _mm256_setzero_ps(); /* 8 float lanes, all 0.0 — running sums */
int i = 0;
for (; i + 8 <= n; i += 8) {
__m256 va = _mm256_loadu_ps(a + i); /* 8 floats of a → register */
__m256 vb = _mm256_loadu_ps(b + i); /* 8 floats of b → register */
acc = _mm256_fmadd_ps(va, vb, acc); /* 8× (aᵢ·bᵢ + accᵢ), one instr, one rounding */
}
/* ---- horizontal reduction: 8 lanes → 1 scalar ---- */
__m128 lo = _mm256_castps256_ps128(acc); /* lanes 0–3 (free reinterpret) */
__m128 hi = _mm256_extractf128_ps(acc, 1); /* lanes 4–7 */
lo = _mm_add_ps(lo, hi); /* {0+4, 1+5, 2+6, 3+7} → 4 sums */
lo = _mm_hadd_ps(lo, lo); /* pairwise → 2 sums */
lo = _mm_hadd_ps(lo, lo); /* pairwise → 1 sum in lane 0 */
float acc8 = _mm_cvtss_f32(lo); /* lane 0 → plain float */
for (; i < n; i++) acc8 += a[i] * b[i]; /* tail when n not divisible by 8 */
return acc8;
}
Two things to notice, because they recur in every kernel in this book. (1) The accumulator is a vector, not a scalar — you keep eight independent running sums and only collapse them at the very end. That “carry partial results, reduce once” pattern is exactly the online-reduction trick behind FlashAttention (Ch. 13). (2) The horizontal sum is awkward on purpose — SIMD lanes are great at independent work, bad at talking to each other. That tension drives a lot of kernel design.
Depth pass — every named intrinsic, taken apart
The prototype above names a stack of intrinsics like they’re self-evident. They’re not. Before any instruction, you need to know what it operates on.
A modern x86 CPU has a set of extra-wide registers — physically, slots of silicon on the chip — called the AVX registers, each 256 bits wide. That’s the whole story behind the type name __m256:
__m256256 bits ÷ 32 bits per float = 8 floats. So a __m256 is a single value that is eight floats glued side by side. The compiler keeps it in one AVX register; you manipulate all eight together. The eight slots are called lanes.
This is the mental upgrade from your scalar past: a float acc is one number in one register. A __m256 acc is eight numbers in one wide register, and an instruction like “add” hits all eight in the same clock. The variants you’ll meet are just other widths and types:
Every Intel intrinsic (a C function that compiles to one CPU instruction) follows a rigid naming grammar. Once you can parse the name, you can read code you’ve never seen. Take the one that looked like noise above:
_mm256_fmadd_psReads as: “on a 256-bit register, do a fused multiply-add, treating the lanes as packed single-precision floats.” The suffix grammar is fixed: ps = packed single (float), pd = packed double, ss = scalar single (just lane 0), epi32 = packed 32-bit integers, epi8 = packed bytes.
So _mm256_setzero_ps = “256-bit, set to zero, as floats” → eight zero lanes. _mm256_loadu_ps = “256-bit, load unaligned, as floats” → copy 8 floats from memory into a register (the u means the address needn’t be 32-byte-aligned; the aligned version _mm256_load_ps is marginally faster but crashes on a misaligned pointer). You are now reading intrinsics as sentences, not memorizing them.
What fmadd actually computes, lane by lane
The fused multiply-add takes three register arguments and computes, for every lane i simultaneously:
acc = _mm256_fmadd_ps(va, vb, acc) — eight independent aᵢ·bᵢ + accᵢ in one instruction. “Fused” = the multiply and add
share a single rounding step, so it’s both faster and more accurate than doing them separately.This is why the accumulator is itself a __m256: each of the eight lanes carries its own running partial sum. Lane 0 accumulates a₀b₀ + a₈b₈ + a₁₆b₁₆ + …, lane 1 accumulates a₁b₉…, and so on. Eight independent dot products of every-8th element, running in parallel. They only get combined at the very end.
That “very end” — the horizontal sum, dismantled
After the loop you hold eight partial sums in one register and need their total — a single scalar. SIMD hardware is built for vertical ops (lane i of one register with lane i of another) and is deliberately clumsy at horizontal ops (combining lanes within one register). That clumsiness is the whole reason this tail looks weird. Walk it one instruction at a time:
_mm256_cvt… (cvt = convert) which does change representation, e.g. float→int.1 picks the high half; 0 would pick the low half, same as the cast above). Gives you lanes 4–7 as a __m128._mm256_ prefix because we are now on 128-bit registers.)hadd = horizontal add. Feeding it the same register twice and calling it twice collapses 4 → 2 → 1, leaving the grand total in lane 0.hadd turns {p,q,r,s} into {p+q, r+s, …} — pairs summed. A second turns {p+q, r+s} into {p+q+r+s, …}. Two halvings to combine four numbers.float you can return. cvt = convert, ss = scalar single (lane 0 only). The bridge from SIMD-land back to a normal scalar.The pattern worth keeping — not the specific seven instructions, but the shape: accumulate vertically in wide lanes for the whole loop, then pay a small one-time horizontal reduction at the end. Every reduction kernel in this book — dot products now, stable softmax in Ch. 12, FlashAttention’s running statistics in Ch. 13 — is a variation on “keep partial results in lanes, reduce once.” The horizontal sum is annoying precisely because the hardware wants you to do it rarely.
Footnote for the curious: production code often skips hadd (it’s slower than it looks) and uses a shuffle-and-add ladder instead. We will meet _mm_shuffle_ps in §1.4. The version here is the most readable correct reduction, which is the right default until you’re profiling.
The fully annotated, compiled source — every line accounted for — is in code/ch01/dot_avx.c. It is the same kernel the Astro build verifies on every build.
Because ⟨a,b⟩ = ‖a‖‖b‖cos θ is defined purely by lengths and the angle between the two vectors — quantities that don’t change if you rotate the whole coordinate frame. Relabeling axes is a rotation, and rotations preserve dot products. This is precisely the property TurboQuant exploits (Ch. 25): it rotates every vector before quantizing, knowing the dot products it ultimately needs survive the rotation untouched.