Fast JL via Hadamard (and the loop to TurboQuant)
Even the Achlioptas ±1 projection costs O(dk) per vector — one add/subtract per entry. For modern embedding dimensions (d = 4096) and reasonable projected dimensions (k = 256), that’s a million ops per vector. Across a million vectors at index time, that’s 10¹² operations. Ailon & Chazelle (2009) proved you can do better. The construction — Fast Johnson-Lindenstrauss Transform (FJLT) — replaces the dense matmul with three structured pieces: random sign flips, a Walsh-Hadamard transform, and a sparse subsampling. Total cost: O(d \log d) instead of O(dk). The Hadamard transform itself — n \log n butterflies, no multiplies — is the same structured rotation that TurboQuant uses to equalize per-coordinate variance (Ch.5 §3) before quantizing. JL and rotation-based quantization are, at the kernel level, the same algorithm; this section closes that loop.
The cost problem with dense JL
Recall §7.2’s k × d projection. Per input vector x, computing P x costs:
For an index pipeline that processes N input vectors, the total is O(Nkd). At N = 10⁶, k = 256, d = 4096, that’s 10¹² operations — about an hour on a single CPU core, a few minutes on a GPU. Fine for a one-time index build; painful if you reindex frequently or if d is larger.
The Fast JLT replaces the dense matmul with structured operations totalling O(d \log d) per vector, independent of k. For d = 4096, that’s d \log_2 d = 49,000 ops per vector instead of k · d = 1,048,000 — a 20× speedup.
The Walsh-Hadamard transform — review
The Walsh-Hadamard transform was introduced in Ch.2 §3 as one of the practical orthogonal constructions. The recurrence:
Applying H_n to a vector takes n \log_2 n operations via a butterfly identical in shape to the FFT. And every operation is an addition or subtraction — no multiplies inside the butterfly, just a single 1/\sqrt n scale per output at the end.
The kernel demonstrates this empirically. For a power-of-2 vector of length n:
void wht(double* x, int n) {
for (int h = 1; h < n; h *= 2)
for (int i = 0; i < n; i += h * 2)
for (int j = i; j < i + h; j++) {
double a = x[j], b = x[j + h];
x[j] = a + b; // butterfly: sum
x[j + h] = a - b; // butterfly: difference
}
double inv = 1.0 / sqrt((double)n);
for (int i = 0; i < n; i++) x[i] *= inv; // amortized 1/√n scale
}
Confirmation that H_n is its own inverse (up to scaling) and preserves the L2 norm exactly:
Walsh-Hadamard transform — confirms norm preservation
n ‖x‖ ‖H x‖ ‖H H x‖ max |H H x − x|
8 2.989011 2.989011 2.989011 6.66e-16
64 8.543423 8.543423 8.543423 4.44e-16
1024 32.934639 32.934639 32.934639 8.88e-16
16384 127.659081 127.659081 127.659081 1.33e-15
For n = 16384: 16384 · 14 = 229k adds, 1 multiply per output.
A dense d×d matmul would be 16384² = 268 million ops. 1000× faster.
return sqrt(-2.0 * log(u1)) * cos(2.0 * M_PI * u2);
}
/* In-place Walsh-Hadamard transform on x of length n (n must be a
* power of 2). Cost: n log₂ n additions/subtractions, NO multiplies
* (the normalisation is a single multiply per output at the end). */
static void wht(double* x, int n) {
for (int h = 1; h < n; h *= 2) {
for (int i = 0; i < n; i += h * 2) {
for (int j = i; j < i + h; j++) {
double a = x[j];
double b = x[j + h];
x[j] = a + b;
x[j + h] = a - b;
}
}H_1 = [1]; H_{2m} = [[H_m, H_m], [H_m, −H_m]]. After scaling by 1/√n, H_n is orthogonal — preserves L2 norm and dot products.
Cost of applying H_n to a vector: n log₂(n) operations via a butterfly with the same shape as the FFT.
No multiplications needed inside the butterfly because every entry of H_n is ±1 — so the recursion at each level is “add this half to the other half” (top output) and “subtract this half from the other half” (bottom output). Pure additions and subtractions. The single 1/√n scaling at the end is the only multiplication, amortised across all n outputs.
Ailon-Chazelle’s Fast JLT
Ailon & Chazelle, “The Fast Johnson-Lindenstrauss Transform and Approximate Nearest Neighbors” (STOC 2006, SICOMP 2009). The construction: instead of a dense random matrix, use a structured one composed of three pieces:
Each piece does specific work:
- Diagonal sign matrix D randomly flips each coordinate of x. Operationally: (Dx)_i = ±x_i with sign chosen independently per coordinate. Cost: O(n) sign flips.
- Hadamard transform H orthogonally mixes the (signed) coordinates. Cost: O(n \log n) via the butterfly.
- Sparse subsampling P picks k output coordinates from the n-dim Hadamard-rotated vector, with small additional sparse linear combinations to spread the information. Cost: O(k).
Total: O(n \log n + k) per vector. For typical sizes, the n \log n dominates — and at n = 4096, that’s ~50K ops instead of the kn ≈ 10⁶ for dense JL.
The Ailon-Chazelle theorem proves this composition satisfies the same JL distortion bound as the dense Gaussian — with the same k = O(\log N / ε²). The random sign flips do enough “randomization” of the input that the deterministic Hadamard then spreads the energy out evenly across coordinates, after which sparse subsampling preserves distances with high probability.
The constant in the FJLT’s k bound is slightly worse than the dense Gaussian’s — you need maybe 10–20% more projected dimensions to hit the same ε. But the per-vector cost saves an order of magnitude, which is the trade everyone takes at scale.
Dense Achlioptas ±1: the projection is a k × d matrix of ±1/√k entries. Per vector: (Px)_i = (1/√k) Σⱼ ±x_j requires d = 4096 add/subtracts per output, k = 256 outputs. Total = k · d = ~10⁶ operations per vector.
Fast JL (Ailon-Chazelle): three steps:
- D — n random sign flips: 4096 ops.
- H — Hadamard butterfly: n log₂ n = 4096 · 12 = 49,152 ops.
- P — sparse subsampling to k = 256: ~256 ops.
Total: ~54K ops — about 20× fewer than dense Achlioptas. As d grows, the speedup grows: at d = 16,384, the ratio is k·d / (d log d) = 256 / 14 ≈ 18× — still 18× but at much higher absolute scale.
Production vector-search systems (FAISS-IVF, Yahoo NRT) use Fast JL exactly because it scales to multi-million-vector indexes that dense JL can’t handle in reasonable time.
The connection back: JL is rotation-based quantization, viewed from the JL side
Stand the three constructions next to each other:
| Construction | What it does | Used in |
|---|---|---|
| Orthogonal Q (Ch.2 §3) | Preserves dot products exactly | Pure rotations, basis changes |
| Random Q (Ch.5 §3) | Approximately equalises per-coord variance + preserves dots | TurboQuant rotation step |
| Hadamard with random signs (Ch.7 §3) | O(d \log d) orthogonal map; preserves dots; approximately equalises variance | Fast JL + TurboQuant |
The Hadamard-with-random-signs structure does both jobs simultaneously. It’s orthogonal (preserves dot products by Ch.2 §3’s invariant). It approximately equalises per-coordinate variance (by Ch.5 §3 / Ch.6 §2’s concentration of measure). And it’s cheap (by §7.3’s O(n \log n) argument). So when TurboQuant says “rotate the data before quantizing,” what it literally does is apply a Hadamard transform with random sign flips — the same construction Ailon-Chazelle introduced for Fast JL.
The full picture, closed:
Ch.2 §3 — orthogonal Q preserves dot products exactly.
Ch.5 §3 — random orthogonal Q equalises per-coordinate variance.
Ch.6 §2 — high-D Gaussians live on a √d shell; smooth functions concentrate.
Ch.6 §3 — random unit vectors are nearly orthogonal in high-D.
Ch.7 §1 — random projection from d to k preserves distances within (1±ε) for k = O(log N / ε²).
Ch.7 §2 — discrete (±1, sparse) projections work as well as Gaussian.
Ch.7 §3 — Hadamard + random signs gives an O(n log n) JL-class projection. ← same construction TurboQuant uses for “rotate before quantize.”
Seven facts. One construction (the random Hadamard rotation). Two production systems (Fast JL, TurboQuant). One conversation thread that started this book.
What the operation actually does, mechanically: apply random ±1 sign flips to each coordinate, then apply the Walsh-Hadamard transform. The combined map is HD where D is a random ±1 diagonal and H is the orthonormal Hadamard matrix. By Ch.2 §3, HD is orthogonal (product of two orthogonal maps).
As Fast JL (§7.3): Ailon-Chazelle’s construction is exactly P · H · D, where P is a sparse subsampling matrix. The HD part is what does the heavy lifting — it spreads the input’s energy uniformly across all output coordinates so the subsequent subsampling P preserves pairwise distances. Cost: O(n log n) for HD, dominated.
As TurboQuant’s variance-equalisation (Ch.5 §3): TurboQuant wants the per-coordinate variances to be approximately equal so a single int8 quantizer with one global scale works for every coordinate. By Ch.5 §3, applying an orthogonal Q transforms the data’s covariance from Σ to QΣQᵀ. A random Q (HD with random D) approximately diagonalises the covariance and equalises the per-coordinate variances. Cost: same O(n log n).
Why it’s the same construction: both Fast JL and TurboQuant’s rotation step need the same three properties of HD:
- Orthogonality (preserves dot products — Ch.2 §3),
- “Spreading” — concentrates the rotated coordinates so each has roughly equal variance (Ch.5 §3 + Ch.6 §2’s concentration of measure),
- Cheap to compute (O(n log n) via Hadamard butterfly, no multiplies in the inner loop).
HD has all three. So one fast Hadamard rotation is doing two jobs: it’s a Fast JL projection (preparing the data for low-distortion dimension reduction by P) AND TurboQuant’s variance-equalisation step (preparing the data for low-error per-coordinate quantization by an int8 codebook). The convergence of two seemingly different production needs onto the same construction is why this section is the natural closing of Part II — the Hadamard-with-random-signs is the kernel under most of what modern memory-efficient vector search does.
END OF CH.7 — Random projections and the Johnson-Lindenstrauss lemma.
§1 (JL statement + the dimension-free claim) · §2 (Achlioptas ±1 and sparse projections) · §3 (Fast JL via Hadamard + the equivalence with TurboQuant’s rotation).
Three sections that close the loop from Ch.6’s concentration of measure to a production-grade construction (the Hadamard-random-sign rotation) used in BOTH Fast JL and TurboQuant. The Ch.7 capstone recall reconstructs the entire seven-fact chain from Ch.2 §3 through here. Coming next: Ch.8 — What “learning” actually is. Loss functions, gradient descent, and the modern optimizer family (SGD, momentum, Adam, AdamW).