The SIMD width ladder

Section 1.4

You have the AVX2 dot product. The kernel you’ll write across this book in ISA after ISA is structurally the same loop. The only things that change are how wide the lanes are, what the intrinsics are called, and how the horizontal-sum tail collapses. Once you can see the invariant, porting is mechanical.

The widths, side by side

Every SIMD extension is a contract about register width and what types pack into it. The picture for floats:

SSE

__m128 · 128 bits · 4 float lanes · since 1999. The original.

AVX

__m256 · 256 bits · 8 float lanes · since 2011 (Sandy Bridge). What we just used.

AVX-512

__m512 · 512 bits · 16 float lanes · server x86 (Skylake-X, Sapphire Rapids). Adds mask registers for predication.

NEON

float32x4_t · 128 bits · 4 float lanes · all 64-bit ARM. Apple Silicon, mobile, AWS Graviton.

SVE

svfloat32_t · length-agnostic · code that runs on whatever width the chip has (128–2048 bits). The ARM answer to AVX-512.

Lane count drives the per-iteration step (8 in AVX2, 4 in NEON, 16 in AVX-512). Width drives memory throughput. Otherwise the loop is the same.

Naming, four ways

The intrinsic for “load N floats from memory” already exists for each width. The grammar is rigid, so once you see the pattern you can predict the rest.

decoding · cross-ISA load intrinsics for 'load some floats'

_mm_loadu_ps

SSE · 4 floats

_mm256_loadu_ps

AVX · 8 floats

_mm512_loadu_ps

AVX-512 · 16 floats

vld1q_f32

NEON · 4 floats

svld1_f32

SVE · whatever width

Intel intrinsics share the mmN prefix and the ps (“packed single”) suffix; the only thing that varies is the width number. NEON drops the prefix convention entirely — v = vector, ld1 = load one register, q = “Q register” (128-bit), f32 = lane type. Different vocabulary, same idea.

The FMA equivalences:

fused multiply-add across ISAs

SSE+FMA

_mm_fmadd_ps(a, b, c) → 4× aᵢbᵢ + cᵢ

AVX2+FMA

_mm256_fmadd_ps(a, b, c) → 8× aᵢbᵢ + cᵢ (the one in §1.2)

AVX-512

_mm512_fmadd_ps(a, b, c) → 16× aᵢbᵢ + cᵢ

NEON

vfmaq_f32(c, a, b) → 4× cᵢ + aᵢbᵢ (note the argument order — accumulator first)

ARM’s argument order genuinely trips people up: NEON’s vfmaq_f32(acc, a, b) puts the accumulator first, the operands second. Intel’s _mm256_fmadd_ps(a, b, c) puts the accumulator last. Same instruction conceptually; reading the docs once is the only fix.

The horizontal sum, four ways

This is where the ISAs diverge most visibly — and it’s the most useful contrast in the chapter, because it tells you what each ISA thinks is hard.

reducing N lanes → 1 scalar

SSE / AVX2

Manual ladder: _mm256_extractf128_ps + _mm_add_ps + two _mm_hadd_ps + _mm_cvtss_f32. Seven instructions, three of them just to "talk between lanes." We dismantled this in §1.2.

AVX-512

_mm512_reduce_add_ps(v). One intrinsic. The hardware finally took the hint — wide reductions are common enough to deserve a real instruction.

NEON

vaddvq_f32(v). One instruction. ARM nailed this from the start.

SVE

svaddv_f32(pg, v). Predicated reduction over the active lanes of a vector-length-agnostic register.

The lesson worth keeping: a feature being clumsy in one ISA is often a hint about when it was designed. AVX2 (2013) treated horizontal reduction as a niche curiosity; AVX-512 (2017) and NEON (~2011) both treated it as first-class. Reading the instruction set is reading the historical priorities of the chip designers.

Look at the NEON twin you already ran

You have already compiled and run this in the test from §1.3 — here it is again, side-by-side-shaped with the AVX kernel from §1.2 so the invariants are visible. Read them in alternation.

dot_avx — 8 lanes, manual hsum AVX2 + FMA

    __m256 acc = _mm256_setzero_ps();   /* 8 float lanes, all 0.0 — running sums */
    int i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 va = _mm256_loadu_ps(a + i); /* 8 floats of a → register */
        __m256 vb = _mm256_loadu_ps(b + i); /* 8 floats of b → register */
        acc = _mm256_fmadd_ps(va, vb, acc); /* 8× (aᵢ·bᵢ + accᵢ), one instr, one rounding */
    }
    /* ---- horizontal reduction: 8 lanes → 1 scalar ---- */
    __m128 lo = _mm256_castps256_ps128(acc);   /* lanes 0–3 (free reinterpret) */
    __m128 hi = _mm256_extractf128_ps(acc, 1); /* lanes 4–7 */
    lo = _mm_add_ps(lo, hi);                   /* {0+4, 1+5, 2+6, 3+7} → 4 sums */
    lo = _mm_hadd_ps(lo, lo);                  /* pairwise → 2 sums */
    lo = _mm_hadd_ps(lo, lo);                  /* pairwise → 1 sum in lane 0 */
    float acc8 = _mm_cvtss_f32(lo);            /* lane 0 → plain float */

    for (; i < n; i++) acc8 += a[i] * b[i];    /* tail when n not divisible by 8 */
    return acc8;
}

dot_neon — 4 lanes, vaddvq_f32 hsum NEON · arm64

    float32x4_t acc = vdupq_n_f32(0.0f);   /* 4 float lanes, all 0.0 */
    int i = 0;
    for (; i + 4 <= n; i += 4) {
        float32x4_t va = vld1q_f32(a + i);
        float32x4_t vb = vld1q_f32(b + i);
        acc = vfmaq_f32(acc, va, vb);      /* acc += va * vb, fused */
    }
    float s = vaddvq_f32(acc);             /* 4 lanes → 1 scalar, one instr */
    for (; i < n; i++) s += a[i] * b[i];
    return s;
}

Same shape: zero the accumulator, loop in width-sized chunks, fused multiply-add, reduce, handle the tail. The kernel is portable; only the dialect changes. That is the practical pay-off of having read §§1.1–1.3 carefully — the invariant operation (a dot product) is now a thing you recognize underneath any ISA dialect it’s dressed in.

The portability claim, sharpened. A “portable dot product” in practice means: one C source per ISA, all four calling the same higher-level signature float dot(const float*, const float*, int). The build system picks the right object at link time (the Makefile in this book uses uname -m). You don’t write one source that compiles to all ISAs — you write one shape you can re-emit per ISA, and a tiny header that resolves the right symbol. That’s how production vector-search and ML libraries (Faiss, hnswlib, Qdrant) ship.

What we have at the end of Chapter 1

A vector seen three ways (algebraic, geometric, SIMD lanes) so no later use of the word can confuse you.
A dot product whose meaning (similarity, projection) and whose kernel (FMA over wide lanes + a horizontal reduction) are both yours.
A distance-from-dot-product identity that reduces L2 distance to one dot product plus a stored scalar — the math underneath every approximate nearest-neighbor system.
Four ISAs read as the same loop, taught how to read the names as sentences.

Everything Part I builds on this. Ch. 2 lifts the dot product into matrices (composition of transformations, the three independent axes you intuited around FlashAttention). Ch. 3 takes the kernel to integers and lossy storage. By Ch. 13, the same _mm256_fmadd_ps shape will be the inner loop of attention — there will be no new operation, just new things to multiply.

— think, then check —

(1) Lane count per loop iteration: 8 on AVX2, 16 on AVX-512, 4 on NEON — set by register width ÷ lane width. The chunk size in the for loop tracks this. (2) Intrinsic vocabulary: _mm256_* vs _mm512_* vs the v*q_*32 NEON family. Same operations, different names. (3) The horizontal-sum tail: a 7-instruction shuffle ladder on AVX2, one _mm512_reduce_add_ps on AVX-512, one vaddvq_f32 on NEON. The hardware’s opinion of how common reductions are shows up here.

Invariant: accumulate in wide lanes, reduce once, handle the tail. That shape is portable; the dialect isn’t.

↳ §1.4 · cross-ISA invariants

END OF CHAPTER 1, §§1–4 — first chapter complete in the Astro build. Live in this build: hover-card term annotations (→ glossary rail, top-right) · then→now terminology callouts · four interactive visualizations · real AVX2/NEON kernels compiled by make -C code as a prebuild step · decay-timed recall injections (§1 priming → §2 decay check → §3 synthesis → §4 cross-ISA generalization).

Next: Chapter 2 — Matrices as transformations.