The dot product

Section 1.2

The dot product

Here is the atom. Every score in this book — the similarity between two embeddings, the QKᵀ entry inside attention, the distance computation inside HNSW — bottoms out in this single operation.

# the definition — sum of elementwise products ⟨a, b⟩ = a·b = Σᵢ aᵢ bᵢ = a₀b₀ + a₁b₁ + ... + aₙ₋₁bₙ₋₁

You know the mechanics. What’s worth re-seizing is the geometry, because it’s what makes the dot product the universal similarity measure:

⟨a, b⟩ = ‖a‖ ‖b‖ cos θ # magnitude × magnitude × alignment

Read that right-to-left and it tells a story: the dot product is large and positive when two vectors point the same way (cos θ → 1), zero when orthogonal (θ = 90°), and negative when they oppose. “Similarity” in an embedding space is just this angle. When you strip out magnitude — divide by both lengths — you get cosine similarity, the metric most text-embedding systems use by default.

angle between a, b θ=48° ⟨a,b⟩=0.57

Drag to change the angle between a and b. Watch ⟨a,b⟩ and the projection (the shadow of one onto the other) move together. At 90° the projection vanishes — that’s orthogonality.

The projection picture is the one to keep: ⟨a,b⟩ is the length of b’s shadow on a, scaled by ‖a‖. Attention scores are exactly this — “how much of the query points along this key.”

Now make it run — scalar first, then SIMD

A toy Python dot product teaches the math but hides the thing you actually care about: how the machine does it fast. We’ll always show both. Here is the honest scalar version, then the vectorized one.

dot_scalar.c baseline · scalar

static float dot_scalar(const float* a, const float* b, int n) {
    float acc = 0.0f;
    for (int i = 0; i < n; i++) acc += a[i] * b[i];
    return acc;
}

One multiply-add per element, fully serial. The CPU can do eight of these at once. The AVX version packs eight floats per register and uses one fused multiply-add intrinsic:

dot_avx.c AVX2 + FMA · x86-64

    __m256 acc = _mm256_setzero_ps();   /* 8 float lanes, all 0.0 — running sums */
    int i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 va = _mm256_loadu_ps(a + i); /* 8 floats of a → register */
        __m256 vb = _mm256_loadu_ps(b + i); /* 8 floats of b → register */
        acc = _mm256_fmadd_ps(va, vb, acc); /* 8× (aᵢ·bᵢ + accᵢ), one instr, one rounding */
    }
    /* ---- horizontal reduction: 8 lanes → 1 scalar ---- */
    __m128 lo = _mm256_castps256_ps128(acc);   /* lanes 0–3 (free reinterpret) */
    __m128 hi = _mm256_extractf128_ps(acc, 1); /* lanes 4–7 */
    lo = _mm_add_ps(lo, hi);                   /* {0+4, 1+5, 2+6, 3+7} → 4 sums */
    lo = _mm_hadd_ps(lo, lo);                  /* pairwise → 2 sums */
    lo = _mm_hadd_ps(lo, lo);                  /* pairwise → 1 sum in lane 0 */
    float acc8 = _mm_cvtss_f32(lo);            /* lane 0 → plain float */

    for (; i < n; i++) acc8 += a[i] * b[i];    /* tail when n not divisible by 8 */
    return acc8;
}

Two things to notice, because they recur in every kernel in this book. (1) The accumulator is a vector, not a scalar — you keep eight independent running sums and only collapse them at the very end. That “carry partial results, reduce once” pattern is exactly the online-reduction trick callback / forward ref Carry running statistics through a tiled computation, reduce once at the end. Here: 8 partial sums. In FlashAttention (Ch. 13): a running max + sum for softmax. Same idea you'll see in FlashAttention. The horizontal sum here is a reduction; the only question in harder cases is whether the reduction is associative (this one is; softmax isn't, which is why it needs more care). behind FlashAttention (Ch. 13). (2) The horizontal sum is awkward on purpose — SIMD lanes are great at independent work, bad at talking to each other. That tension drives a lot of kernel design.

Depth pass — every named intrinsic, taken apart

The prototype above names a stack of intrinsics like they’re self-evident. They’re not. Before any instruction, you need to know what it operates on.

A modern x86 CPU has a set of extra-wide registers — physically, slots of silicon on the chip — called the AVX registers, each 256 bits wide. That’s the whole story behind the type name __m256:

decoding · __m256

__m

Microsoft/Intel convention for "machine vector type"

256

width in bits — one AVX register

256 bits ÷ 32 bits per float = 8 floats. So a __m256 is a single value that is eight floats glued side by side. The compiler keeps it in one AVX register; you manipulate all eight together. The eight slots are called lanes.

This is the mental upgrade from your scalar past: a float acc is one number in one register. A __m256 acc is eight numbers in one wide register, and an instruction like “add” hits all eight in the same clock. The variants you’ll meet are just other widths and types:

the type family

__m128

128-bit SSE register — 4 floats (or 2 doubles, or 16 bytes). The older, narrower vector.

__m256

256-bit AVX register — 8 floats. What we use in the dot product.

__m512

512-bit AVX-512 register — 16 floats. The widest on current x86.

__m256i

256-bit holding integers — 8×int32, 16×int16, or 32×int8. The i suffix = integer lanes. (Quantization kernels use this — Ch. 3.)

Every Intel intrinsic (a C function that compiles to one CPU instruction) follows a rigid naming grammar. Once you can parse the name, you can read code you’ve never seen. Take the one that looked like noise above:

decoding · _mm256_fmadd_ps

_mm256

operates on 256-bit registers

fmadd

the operation: fused multiply-add

"packed single" — 32-bit float lanes

Reads as: “on a 256-bit register, do a fused multiply-add, treating the lanes as packed single-precision floats.” The suffix grammar is fixed: ps = packed single (float), pd = packed double, ss = scalar single (just lane 0), epi32 = packed 32-bit integers, epi8 = packed bytes.

So _mm256_setzero_ps = “256-bit, set to zero, as floats” → eight zero lanes. _mm256_loadu_ps = “256-bit, load unaligned, as floats” → copy 8 floats from memory into a register (the u means the address needn’t be 32-byte-aligned; the aligned version _mm256_load_ps is marginally faster but crashes on a misaligned pointer). You are now reading intrinsics as sentences, not memorizing them.

What `fmadd` actually computes, lane by lane

The fused multiply-add takes three register arguments and computes, for every lane i simultaneously:

acc = _mm256_fmadd_ps(va, vb, acc) — eight independent aᵢ·bᵢ + accᵢ in one instruction. “Fused” = the multiply and add share a single rounding step, so it’s both faster and more accurate than doing them separately.

This is why the accumulator is itself a __m256: each of the eight lanes carries its own running partial sum. Lane 0 accumulates a₀b₀ + a₈b₈ + a₁₆b₁₆ + …, lane 1 accumulates a₁b₉…, and so on. Eight independent dot products of every-8th element, running in parallel. They only get combined at the very end.

That “very end” — the horizontal sum, dismantled

After the loop you hold eight partial sums in one register and need their total — a single scalar. SIMD hardware is built for vertical ops (lane i of one register with lane i of another) and is deliberately clumsy at horizontal ops (combining lanes within one register). That clumsiness is the whole reason this tail looks weird. Walk it one instruction at a time:

Start: eight partial sums sitting in one 256-bit register.

_mm256_castps256_ps128(acc)

__m256 → __m128 · "cast 256-bit-ps to 128-bit-ps"

does

Nothing at runtime. It is a reinterpret, not a move — it tells the compiler "treat the bottom 128 bits of this 256-bit register as a 128-bit register." Zero instructions, zero cost. Gives you lanes 0–3.

why "cast"

Same sense as a C pointer cast: reinterpret the bits, don't convert them. Contrast with _mm256_cvt… (cvt = convert) which does change representation, e.g. float→int.

_mm256_extractf128_ps(acc, 1)

__m256 → __m128 · "extract a 128-bit float half"

does

Pulls out the upper 128 bits (the 1 picks the high half; 0 would pick the low half, same as the cast above). Gives you lanes 4–7 as a __m128.

now you have

Two 128-bit registers: lo = {s0,s1,s2,s3} and hi = {s4,s5,s6,s7}.

_mm_add_ps(lo, hi)

(__m128, __m128) → __m128 · "128-bit, add, packed single"

does

Ordinary vertical add of two 128-bit registers — 4 lanes added to 4 lanes. Result: {s0+s4, s1+s5, s2+s6, s3+s7}. Eight partial sums folded down to four. (No _mm256_ prefix because we are now on 128-bit registers.)

_mm_hadd_ps(v, v)

(__m128, __m128) → __m128 · "128-bit, HORIZONTAL add, packed single"

does

The one genuinely horizontal instruction: it adds adjacent lanes within a register. hadd = horizontal add. Feeding it the same register twice and calling it twice collapses 4 → 2 → 1, leaving the grand total in lane 0.

why twice

One hadd turns {p,q,r,s} into {p+q, r+s, …} — pairs summed. A second turns {p+q, r+s} into {p+q+r+s, …}. Two halvings to combine four numbers.

_mm_cvtss_f32(v)

__m128 → float · "convert scalar single to a C float"

does

Extracts lane 0 as an ordinary float you can return. cvt = convert, ss = scalar single (lane 0 only). The bridge from SIMD-land back to a normal scalar.

The pattern worth keeping — not the specific seven instructions, but the shape: accumulate vertically in wide lanes for the whole loop, then pay a small one-time horizontal reduction at the end. Every reduction kernel in this book — dot products now, stable softmax in Ch. 12, FlashAttention’s running statistics in Ch. 13 — is a variation on “keep partial results in lanes, reduce once.” The horizontal sum is annoying precisely because the hardware wants you to do it rarely.

Footnote for the curious: production code often skips hadd (it’s slower than it looks) and uses a shuffle-and-add ladder instead. We will meet _mm_shuffle_ps in §1.4. The version here is the most readable correct reduction, which is the right default until you’re profiling.

The fully annotated, compiled source — every line accounted for — is in code/ch01/dot_avx.c. It is the same kernel the Astro build verifies on every build.

— think, then check —

Because ⟨a,b⟩ = ‖a‖‖b‖cos θ is defined purely by lengths and the angle between the two vectors — quantities that don’t change if you rotate the whole coordinate frame. Relabeling axes is a rotation, and rotations preserve dot products. This is precisely the property TurboQuant exploits (Ch. 25): it rotates every vector before quantizing, knowing the dot products it ultimately needs survive the rotation untouched.

↳ first stated in §1.1, "rotation-friendly"

The dot product

Now make it run — scalar first, then SIMD

Depth pass — every named intrinsic, taken apart

What fmadd actually computes, lane by lane

That “very end” — the horizontal sum, dismantled

What `fmadd` actually computes, lane by lane