The SIMD width ladder
You have the AVX2 dot product. The kernel you’ll write across this book in ISA after ISA is structurally the same loop. The only things that change are how wide the lanes are, what the intrinsics are called, and how the horizontal-sum tail collapses. Once you can see the invariant, porting is mechanical.
The widths, side by side
Every SIMD extension is a contract about register width and what types pack into it. The picture for floats:
Lane count drives the per-iteration step (8 in AVX2, 4 in NEON, 16 in AVX-512). Width drives memory throughput. Otherwise the loop is the same.
Naming, four ways
The intrinsic for “load N floats from memory” already exists for each width. The grammar is rigid, so once you see the pattern you can predict the rest.
cross-ISA load intrinsics for 'load some floats'Intel intrinsics share the mmN prefix and the ps (“packed single”) suffix; the only thing that varies is the width number. NEON drops the prefix convention entirely — v = vector, ld1 = load one register, q = “Q register” (128-bit), f32 = lane type. Different vocabulary, same idea.
The FMA equivalences:
_mm_fmadd_ps(a, b, c) → 4× aᵢbᵢ + cᵢ_mm256_fmadd_ps(a, b, c) → 8× aᵢbᵢ + cᵢ (the one in §1.2)_mm512_fmadd_ps(a, b, c) → 16× aᵢbᵢ + cᵢvfmaq_f32(c, a, b) → 4× cᵢ + aᵢbᵢ (note the argument order — accumulator first)ARM’s argument order genuinely trips people up: NEON’s vfmaq_f32(acc, a, b) puts the accumulator first, the operands second. Intel’s _mm256_fmadd_ps(a, b, c) puts the accumulator last. Same instruction conceptually; reading the docs once is the only fix.
The horizontal sum, four ways
This is where the ISAs diverge most visibly — and it’s the most useful contrast in the chapter, because it tells you what each ISA thinks is hard.
_mm256_extractf128_ps + _mm_add_ps + two _mm_hadd_ps + _mm_cvtss_f32. Seven instructions, three of them just to "talk between lanes." We dismantled this in §1.2._mm512_reduce_add_ps(v). One intrinsic. The hardware finally took the hint — wide reductions are common enough to deserve a real instruction.vaddvq_f32(v). One instruction. ARM nailed this from the start.svaddv_f32(pg, v). Predicated reduction over the active lanes of a vector-length-agnostic register.The lesson worth keeping: a feature being clumsy in one ISA is often a hint about when it was designed. AVX2 (2013) treated horizontal reduction as a niche curiosity; AVX-512 (2017) and NEON (~2011) both treated it as first-class. Reading the instruction set is reading the historical priorities of the chip designers.
Look at the NEON twin you already ran
You have already compiled and run this in the test from §1.3 — here it is again, side-by-side-shaped with the AVX kernel from §1.2 so the invariants are visible. Read them in alternation.
__m256 acc = _mm256_setzero_ps(); /* 8 float lanes, all 0.0 — running sums */
int i = 0;
for (; i + 8 <= n; i += 8) {
__m256 va = _mm256_loadu_ps(a + i); /* 8 floats of a → register */
__m256 vb = _mm256_loadu_ps(b + i); /* 8 floats of b → register */
acc = _mm256_fmadd_ps(va, vb, acc); /* 8× (aᵢ·bᵢ + accᵢ), one instr, one rounding */
}
/* ---- horizontal reduction: 8 lanes → 1 scalar ---- */
__m128 lo = _mm256_castps256_ps128(acc); /* lanes 0–3 (free reinterpret) */
__m128 hi = _mm256_extractf128_ps(acc, 1); /* lanes 4–7 */
lo = _mm_add_ps(lo, hi); /* {0+4, 1+5, 2+6, 3+7} → 4 sums */
lo = _mm_hadd_ps(lo, lo); /* pairwise → 2 sums */
lo = _mm_hadd_ps(lo, lo); /* pairwise → 1 sum in lane 0 */
float acc8 = _mm_cvtss_f32(lo); /* lane 0 → plain float */
for (; i < n; i++) acc8 += a[i] * b[i]; /* tail when n not divisible by 8 */
return acc8;
}
float32x4_t acc = vdupq_n_f32(0.0f); /* 4 float lanes, all 0.0 */
int i = 0;
for (; i + 4 <= n; i += 4) {
float32x4_t va = vld1q_f32(a + i);
float32x4_t vb = vld1q_f32(b + i);
acc = vfmaq_f32(acc, va, vb); /* acc += va * vb, fused */
}
float s = vaddvq_f32(acc); /* 4 lanes → 1 scalar, one instr */
for (; i < n; i++) s += a[i] * b[i];
return s;
}Same shape: zero the accumulator, loop in width-sized chunks, fused multiply-add, reduce, handle the tail. The kernel is portable; only the dialect changes. That is the practical pay-off of having read §§1.1–1.3 carefully — the invariant operation (a dot product) is now a thing you recognize underneath any ISA dialect it’s dressed in.
The portability claim, sharpened. A “portable dot product” in practice means: one C source per ISA, all four calling the same higher-level signature float dot(const float*, const float*, int). The build system picks the right object at link time (the Makefile in this book uses uname -m). You don’t write one source that compiles to all ISAs — you write one shape you can re-emit per ISA, and a tiny header that resolves the right symbol. That’s how production vector-search and ML libraries (Faiss, hnswlib, Qdrant) ship.
What we have at the end of Chapter 1
- A vector seen three ways (algebraic, geometric, SIMD lanes) so no later use of the word can confuse you.
- A dot product whose meaning (similarity, projection) and whose kernel (FMA over wide lanes + a horizontal reduction) are both yours.
- A distance-from-dot-product identity that reduces L2 distance to one dot product plus a stored scalar — the math underneath every approximate nearest-neighbor system.
- Four ISAs read as the same loop, taught how to read the names as sentences.
Everything Part I builds on this. Ch. 2 lifts the dot product into matrices (composition of transformations, the three independent axes you intuited around FlashAttention). Ch. 3 takes the kernel to integers and lossy storage. By Ch. 13, the same _mm256_fmadd_ps shape will be the inner loop of attention — there will be no new operation, just new things to multiply.
(1) Lane count per loop iteration: 8 on AVX2, 16 on AVX-512, 4 on NEON — set by register width ÷ lane width. The chunk size in the for loop tracks this.
(2) Intrinsic vocabulary: _mm256_* vs _mm512_* vs the v*q_*32 NEON family. Same operations, different names.
(3) The horizontal-sum tail: a 7-instruction shuffle ladder on AVX2, one _mm512_reduce_add_ps on AVX-512, one vaddvq_f32 on NEON. The hardware’s opinion of how common reductions are shows up here.
Invariant: accumulate in wide lanes, reduce once, handle the tail. That shape is portable; the dialect isn’t.
END OF CHAPTER 1, §§1–4 — first chapter complete in the Astro build. Live in this build: hover-card term annotations (→ glossary rail, top-right) · then→now terminology callouts · four interactive visualizations · real AVX2/NEON kernels compiled by make -C code as a prebuild step · decay-timed recall injections (§1 priming → §2 decay check → §3 synthesis → §4 cross-ISA generalization).
Next: Chapter 2 — Matrices as transformations.