Fast JL via Hadamard (and the loop to TurboQuant)

Section 7.3

Fast JL via Hadamard (and the loop to TurboQuant)

Even the Achlioptas ±1 projection costs O(dk) per vector — one add/subtract per entry. For modern embedding dimensions (d = 4096) and reasonable projected dimensions (k = 256), that’s a million ops per vector. Across a million vectors at index time, that’s 10¹² operations. Ailon & Chazelle (2009) proved you can do better. The construction — Fast Johnson-Lindenstrauss Transform (FJLT) — replaces the dense matmul with three structured pieces: random sign flips, a Walsh-Hadamard transform, and a sparse subsampling. Total cost: O(d \log d) instead of O(dk). The Hadamard transform itself — n \log n butterflies, no multiplies — is the same structured rotation that TurboQuant uses to equalize per-coordinate variance (Ch.5 §3) before quantizing. JL and rotation-based quantization are, at the kernel level, the same algorithm; this section closes that loop.

The cost problem with dense JL

Recall §7.2’s k × d projection. Per input vector x, computing P x costs:

(Px)_i = Σ_{j=1..d} P_{ij} · x_j ← d operations Px = vector of k such sums ← k · d operations total

For an index pipeline that processes N input vectors, the total is O(Nkd). At N = 10⁶, k = 256, d = 4096, that’s 10¹² operations — about an hour on a single CPU core, a few minutes on a GPU. Fine for a one-time index build; painful if you reindex frequently or if d is larger.

The Fast JLT replaces the dense matmul with structured operations totalling O(d \log d) per vector, independent of k. For d = 4096, that’s d \log_2 d = 49,000 ops per vector instead of k · d = 1,048,000 — a 20× speedup.

The Walsh-Hadamard transform — review

The Walsh-Hadamard transform core algorithm A linear transformation H_n : ℝⁿ → ℝⁿ (n a power of 2) defined recursively by H_1 = [1], H_{2m} = [[H_m, H_m], [H_m, -H_m]]. After scaling by 1/√n it's orthogonal (preserves L2 norm and pairwise dot products — Ch.2 §3). Can be computed in O(n log n) by an FFT-like butterfly with NO multiplications in the inner loops (only adds/subtracts). The structured rotation underneath Fast JL (§7.3), TurboQuant (Ch.25), and Hadamard-based hash sketches. was introduced in Ch.2 §3 as one of the practical orthogonal constructions. The recurrence:

H_1 = [ 1 ] H_2 = ┌ 1 1 ┐ └ 1 −1 ┘ H_{2m} = ┌ H_m H_m ┐ └ H_m −H_m ┘ After scaling by 1/√n, H_n is orthogonal: (H_n / √n)ᵀ (H_n / √n) = I.

Applying H_n to a vector takes n \log_2 n operations via a butterfly identical in shape to the FFT. And every operation is an addition or subtraction — no multiplies inside the butterfly, just a single 1/\sqrt n scale per output at the end.

The kernel demonstrates this empirically. For a power-of-2 vector of length n:

void wht(double* x, int n) {
    for (int h = 1; h < n; h *= 2)
        for (int i = 0; i < n; i += h * 2)
            for (int j = i; j < i + h; j++) {
                double a = x[j], b = x[j + h];
                x[j]     = a + b;       // butterfly: sum
                x[j + h] = a - b;       // butterfly: difference
            }
    double inv = 1.0 / sqrt((double)n);
    for (int i = 0; i < n; i++) x[i] *= inv;  // amortized 1/√n scale
}

Confirmation that H_n is its own inverse (up to scaling) and preserves the L2 norm exactly:

Walsh-Hadamard transform — confirms norm preservation
n        ‖x‖           ‖H x‖         ‖H H x‖       max |H H x − x|
8        2.989011      2.989011      2.989011      6.66e-16
64       8.543423      8.543423      8.543423      4.44e-16
1024     32.934639     32.934639     32.934639     8.88e-16
16384    127.659081    127.659081    127.659081    1.33e-15

For n = 16384: 16384 · 14 = 229k adds, 1 multiply per output.
A dense d×d matmul would be 16384² = 268 million ops. 1000× faster.

fast_jl.c (wht) C · Walsh-Hadamard transform in-place

    return sqrt(-2.0 * log(u1)) * cos(2.0 * M_PI * u2);
}

/* In-place Walsh-Hadamard transform on x of length n (n must be a
 * power of 2). Cost: n log₂ n additions/subtractions, NO multiplies
 * (the normalisation is a single multiply per output at the end). */
static void wht(double* x, int n) {
    for (int h = 1; h < n; h *= 2) {
        for (int i = 0; i < n; i += h * 2) {
            for (int j = i; j < i + h; j++) {
                double a = x[j];
                double b = x[j + h];
                x[j]     = a + b;
                x[j + h] = a - b;
            }
        }

— think, then check —

H_1 = [1]; H_{2m} = [[H_m, H_m], [H_m, −H_m]]. After scaling by 1/√n, H_n is orthogonal — preserves L2 norm and dot products.

Cost of applying H_n to a vector: n log₂(n) operations via a butterfly with the same shape as the FFT.

No multiplications needed inside the butterfly because every entry of H_n is ±1 — so the recursion at each level is “add this half to the other half” (top output) and “subtract this half from the other half” (bottom output). Pure additions and subtractions. The single 1/√n scaling at the end is the only multiplication, amortised across all n outputs.

↳ §7.3 WHT definition

Ailon-Chazelle’s Fast JLT

Ailon & Chazelle, “The Fast Johnson-Lindenstrauss Transform and Approximate Nearest Neighbors” (STOC 2006, SICOMP 2009). The construction: instead of a dense random matrix, use a structured one composed of three pieces:

Fast JLT: Φ = P · H · D where: D = diagonal matrix of random ±1 signs (random rotation of basis) H = Walsh-Hadamard transform, n × n (orthogonal mixing) P = sparse subsampling, k × n with few nonzero entries (downsample to k dims)

Each piece does specific work:

Diagonal sign matrix D randomly flips each coordinate of x. Operationally: (Dx)_i = ±x_i with sign chosen independently per coordinate. Cost: O(n) sign flips.
Hadamard transform H orthogonally mixes the (signed) coordinates. Cost: O(n \log n) via the butterfly.
Sparse subsampling P picks k output coordinates from the n-dim Hadamard-rotated vector, with small additional sparse linear combinations to spread the information. Cost: O(k).

Total: O(n \log n + k) per vector. For typical sizes, the n \log n dominates — and at n = 4096, that’s ~50K ops instead of the kn ≈ 10⁶ for dense JL.

The Ailon-Chazelle theorem proves this composition satisfies the same JL distortion bound as the dense Gaussian — with the same k = O(\log N / ε²). The random sign flips do enough “randomization” of the input that the deterministic Hadamard then spreads the energy out evenly across coordinates, after which sparse subsampling preserves distances with high probability.

The constant in the FJLT’s k bound is slightly worse than the dense Gaussian’s — you need maybe 10–20% more projected dimensions to hit the same ε. But the per-vector cost saves an order of magnitude, which is the trade everyone takes at scale.

— think, then check —

Dense Achlioptas ±1: the projection is a k × d matrix of ±1/√k entries. Per vector: (Px)_i = (1/√k) Σⱼ ±x_j requires d = 4096 add/subtracts per output, k = 256 outputs. Total = k · d = ~10⁶ operations per vector.

Fast JL (Ailon-Chazelle): three steps:

D — n random sign flips: 4096 ops.
H — Hadamard butterfly: n log₂ n = 4096 · 12 = 49,152 ops.
P — sparse subsampling to k = 256: ~256 ops.

Total: ~54K ops — about 20× fewer than dense Achlioptas. As d grows, the speedup grows: at d = 16,384, the ratio is k·d / (d log d) = 256 / 14 ≈ 18× — still 18× but at much higher absolute scale.

Production vector-search systems (FAISS-IVF, Yahoo NRT) use Fast JL exactly because it scales to multi-million-vector indexes that dense JL can’t handle in reasonable time.

↳ §7.3 cost analysis

The connection back: JL is rotation-based quantization, viewed from the JL side

Stand the three constructions next to each other:

Construction	What it does	Used in
Orthogonal Q (Ch.2 §3)	Preserves dot products exactly	Pure rotations, basis changes
Random Q (Ch.5 §3)	Approximately equalises per-coord variance + preserves dots	TurboQuant rotation step
Hadamard with random signs (Ch.7 §3)	O(d \log d) orthogonal map; preserves dots; approximately equalises variance	Fast JL + TurboQuant

The Hadamard-with-random-signs structure does both jobs simultaneously. It’s orthogonal (preserves dot products by Ch.2 §3’s invariant). It approximately equalises per-coordinate variance (by Ch.5 §3 / Ch.6 §2’s concentration of measure). And it’s cheap (by §7.3’s O(n \log n) argument). So when TurboQuant says “rotate the data before quantizing,” what it literally does is apply a Hadamard transform with random sign flips — the same construction Ailon-Chazelle introduced for Fast JL.

The full picture, closed:
Ch.2 §3 — orthogonal Q preserves dot products exactly.
Ch.5 §3 — random orthogonal Q equalises per-coordinate variance.
Ch.6 §2 — high-D Gaussians live on a √d shell; smooth functions concentrate.
Ch.6 §3 — random unit vectors are nearly orthogonal in high-D.
Ch.7 §1 — random projection from d to k preserves distances within (1±ε) for k = O(log N / ε²).
Ch.7 §2 — discrete (±1, sparse) projections work as well as Gaussian.
Ch.7 §3 — Hadamard + random signs gives an O(n log n) JL-class projection. ← same construction TurboQuant uses for “rotate before quantize.”

Seven facts. One construction (the random Hadamard rotation). Two production systems (Fast JL, TurboQuant). One conversation thread that started this book.

— think, then check —

What the operation actually does, mechanically: apply random ±1 sign flips to each coordinate, then apply the Walsh-Hadamard transform. The combined map is HD where D is a random ±1 diagonal and H is the orthonormal Hadamard matrix. By Ch.2 §3, HD is orthogonal (product of two orthogonal maps).

As Fast JL (§7.3): Ailon-Chazelle’s construction is exactly P · H · D, where P is a sparse subsampling matrix. The HD part is what does the heavy lifting — it spreads the input’s energy uniformly across all output coordinates so the subsequent subsampling P preserves pairwise distances. Cost: O(n log n) for HD, dominated.

As TurboQuant’s variance-equalisation (Ch.5 §3): TurboQuant wants the per-coordinate variances to be approximately equal so a single int8 quantizer with one global scale works for every coordinate. By Ch.5 §3, applying an orthogonal Q transforms the data’s covariance from Σ to QΣQᵀ. A random Q (HD with random D) approximately diagonalises the covariance and equalises the per-coordinate variances. Cost: same O(n log n).

Why it’s the same construction: both Fast JL and TurboQuant’s rotation step need the same three properties of HD:

Orthogonality (preserves dot products — Ch.2 §3),
“Spreading” — concentrates the rotated coordinates so each has roughly equal variance (Ch.5 §3 + Ch.6 §2’s concentration of measure),
Cheap to compute (O(n log n) via Hadamard butterfly, no multiplies in the inner loop).

HD has all three. So one fast Hadamard rotation is doing two jobs: it’s a Fast JL projection (preparing the data for low-distortion dimension reduction by P) AND TurboQuant’s variance-equalisation step (preparing the data for low-error per-coordinate quantization by an int8 codebook). The convergence of two seemingly different production needs onto the same construction is why this section is the natural closing of Part II — the Hadamard-with-random-signs is the kernel under most of what modern memory-efficient vector search does.

↳ §7.3 + Ch.2 §3 + Ch.5 §3 + Ch.6 §§2-3

END OF CH.7 — Random projections and the Johnson-Lindenstrauss lemma.

§1 (JL statement + the dimension-free claim) · §2 (Achlioptas ±1 and sparse projections) · §3 (Fast JL via Hadamard + the equivalence with TurboQuant’s rotation).

Three sections that close the loop from Ch.6’s concentration of measure to a production-grade construction (the Hadamard-random-sign rotation) used in BOTH Fast JL and TurboQuant. The Ch.7 capstone recall reconstructs the entire seven-fact chain from Ch.2 §3 through here. Coming next: Ch.8 — What “learning” actually is. Loss functions, gradient descent, and the modern optimizer family (SGD, momentum, Adam, AdamW).