Norms & distance
A norm answers “how long is this vector?” — and the surprise for returning engineers is that there’s more than one defensible answer. The dot product already handed us one: a vector’s length is the square root of its dot product with itself.
That’s the L2 norm — ordinary straight-line length, the one your geometry intuition already holds. But it’s a member of a family, the Lp norms:
Distance is the norm of a difference. The Euclidean distance between two points is ‖a − b‖₂, and here’s the identity that ties this whole chapter together — distance, expanded, is built entirely from dot products and norms:
Stare at that for a second, because it’s not just algebra — it’s an engineering result. If you’ve precomputed and stored ‖a‖² and ‖b‖², then Euclidean distance costs you exactly one dot product plus two lookups. This is precisely the trick that lets TurboQuant store one scalar norm per vector and reconstruct full L2 distance from compressed codes.
Why TurboQuant supports L2 and dot but not L1: rotation preserves ‖·‖₂ exactly (it’s built from the rotation-invariant dot product) but mangles ‖·‖₁. The norm you pick decides whether you’re allowed to rotate — a §1.3 fact with Ch. 25 consequences.
The norm, vectorized
An L2 norm is just a dot product of a vector with itself, then a square root — so we already have the hard part. The squared norm reuses the FMA loop verbatim. Here is the AVX version straight from the compiled source tree:
__m256 acc = _mm256_setzero_ps();
int i = 0;
for (; i + 8 <= n; i += 8) {
__m256 va = _mm256_loadu_ps(a + i);
acc = _mm256_fmadd_ps(va, va, acc); /* square each lane */
}
__m128 lo = _mm256_castps256_ps128(acc);
__m128 hi = _mm256_extractf128_ps(acc, 1);
lo = _mm_add_ps(lo, hi);
lo = _mm_hadd_ps(lo, lo);
lo = _mm_hadd_ps(lo, lo);
float s = _mm_cvtss_f32(lo);
for (; i < n; i++) s += a[i] * a[i];
return s; /* caller does sqrtf() iff needed */
}
And the NEON twin running natively on this Apple Silicon machine — half the lane width, but vaddvq_f32 collapses 4 lanes to 1 scalar in a single instruction instead of the x86 dance:
float sqnorm_neon(const float* a, int n) {
float32x4_t acc = vdupq_n_f32(0.0f);
int i = 0;
for (; i + 4 <= n; i += 4) {
float32x4_t va = vld1q_f32(a + i);
acc = vfmaq_f32(acc, va, va);
}
float s = vaddvq_f32(acc);
for (; i < n; i++) s += a[i] * a[i];
return s;
}The “skip the sqrt” note is a real production technique, not a micro-optimization curiosity. Nearest-neighbor search compares distances; it never needs their absolute values. Since √ is monotonic, ‖x‖² < ‖y‖² iff ‖x‖ < ‖y‖ — so HNSW and friends rank candidates on squared distance and pay for the square root zero times. Recognizing when an expensive monotonic step can be dropped is exactly the systems instinct this book wants to connect to the math.
Expand: ‖q−v‖² = ‖q‖² + ‖v‖² − 2⟨q,v⟩. The term ‖q‖² is the same for every v, so it doesn’t affect ranking — drop it. ‖v‖² is precomputed and stored. So per database vector you compute exactly one dot product ⟨q,v⟩ (your dot_avx kernel), then rank on ‖v‖² − 2⟨q,v⟩. No square roots, no recomputed norms. That single fused-multiply-add loop, run across millions of stored vectors, is the hot path of vector search — and when those v are compressed, it becomes the TurboQuant scoring kernel from our conversation. Everything in Ch. 25 is this recall answer, optimized.