The Johnson–Lindenstrauss lemma

Section 7.1

The Johnson–Lindenstrauss lemma

Ch.6 §1 told the bad story: in high dimensions, distance is broken. Ch.6 §2 told the good story: high-dimensional Gaussians live on a thin shell, and smooth functions of many independent variables concentrate. Now we extract the most useful constructive consequence of the good story: you can take any high-D dataset, project it down through a random linear map to a much lower dimension, and pairwise distances are approximately preserved. The number of projected dimensions you need scales with log N (number of points) and 1/ε² (precision) — but is independent of the original dimension d. That dimension-free claim, due to Johnson & Lindenstrauss 1984, is the entire reason random-projection methods exist. Vector search (Ch.25), QJL (the JL-corrected TurboQuant variant), and many other compression schemes in modern ML are direct applications.

The statement

The JL lemma, in its modern (Achlioptas) constructive form:

Theorem (Johnson–Lindenstrauss, 1984; Achlioptas, 2003 constructive) For any N points in ℝᵈ and any tolerance 0 < ε < 1, there is a linear map P : ℝᵈ → ℝᵏ with k = O( log N / ε² ) such that for every pair of points x_i, x_j: (1 − ε) ‖x_i − x_j‖² ≤ ‖P x_i − P x_j‖² ≤ (1 + ε) ‖x_i − x_j‖² In particular, P can be a random matrix with entries P_ij ~ N(0, 1/k) or P_ij ∈ {±1/√k} (Achlioptas), and the bound holds with probability ≥ 1 − N⁻² (provably high — just pick a random P).

Read the conclusion line out loud: the squared distance between any pair of original points, scaled into the projected space, is within a factor of (1 ± ε) of the original squared distance. No distortion exceeds ε. The projection is near-isometric on the dataset.

Three things to register about that k = O(log N / ε²) formula:

It depends on N, not d. A million points in 10,000 dimensions can be projected to k ≈ 14 / ε² dimensions (since \log 10⁶ ≈ 14). Drop ε to 0.1 and you need ~1,400 projected dims. Even one billion points needs only ~2,100 dims at ε = 0.1. The original d is completely irrelevant to the result.
It’s logarithmic in N. Doubling the number of points only adds \log 2 / ε² dimensions. A dataset that’s 1000× larger only needs 10/ε² more dimensions.
The 1/ε² dependence is sharp. Halving the distortion budget costs 4× the dimensions. This is the standard ε² rate seen everywhere concentration-of-measure arguments produce bounds.

original dim d d = 500 projected dim k k = 50

mean ratio 0.9844

std of ratio 0.0994

predicted ≈ 1/√k 0.1414

max |ratio − 1| 0.3452

pairs measured 1770

JL says: for any fixed precision target ε and number of points N, you only need k = O(log N / ε²) projected dimensions — independent of d. Drag d wildly while k stays put; the histogram doesn't change much. The shape is controlled by k, not d. That dimension-free result is what makes random projection a viable preprocessing step for arbitrarily high-dimensional data.

Histogram of (projected distance) / (original distance) across all pairs of 60 random unit Gaussian vectors. The JL lemma says this is tightly concentrated at 1, with a spread that depends only on the projected dimension k. The empirical spread matches the theoretical 1/√k to within sampling noise.

Drag d wildly while leaving k alone. The histogram of pairwise-distance ratios stays put — exactly the dimension-free claim. Drag k up and the histogram tightens. The empirical std of the ratio distribution matches 1/√k closely.

— think, then check —

For N points in ℝᵈ and any ε > 0, there’s a linear map P : ℝᵈ → ℝᵏ with k = O(log N / ε²) such that all pairwise squared distances are preserved within a (1 ± ε) factor. A random Gaussian or ±1 projection works with high probability.

k depends on N (logarithmically) and ε (as 1/ε²). It is independent of d. That’s the surprising and useful part: you can project 100,000-dim data to 200-dim and preserve pairwise distances to ~5%, regardless of how large the original dimension was.

↳ §7.1 lemma statement

Why this works — the high-D shell, applied

The proof is one calculation away from Ch.6 §2’s concentration of measure. Here it is in three lines, for a single fixed vector x ∈ ℝᵈ.

Let P ∈ ℝᵏˣᵈ have i.i.d. entries P_ij ∼ N(0, 1/k). Then Px ∈ ℝᵏ. The i-th coordinate of Px is a linear combination of standard normals — itself a normal:

(Px)_i = Σⱼ P_ij · x_j ~ N(0, ‖x‖² / k)

So each coordinate of the projection is a Gaussian with variance proportional to ‖x‖² / k. The squared norm of the projection is a sum of squares of these:

‖Px‖² = Σᵢ (Px)_i² where each (Px)_i ~ N(0, ‖x‖²/k)

This is a chi-squared random variable scaled by ‖x‖²/k. Its mean is exactly ‖x‖² (the projection preserves squared norm on average) and its variance is 2‖x‖⁴/k. By a standard sub-Gaussian bound:

P( | ‖Px‖² − ‖x‖² | > ε ‖x‖² ) ≤ 2 exp( − c · ε² · k ) for some constant c > 0. So as k grows, ‖Px‖² is sharply concentrated at ‖x‖².

The probability of more than ε distortion shrinks exponentially in k. To get all N(N−1)/2 ≈ N²/2 pairs preserved simultaneously, union-bound:

P( some pair has distortion > ε ) ≤ N² · 2 exp(−c ε² k) Setting k = O(log N / ε²) makes this < N⁻²: every pair survives w.h.p.

That last line is the JL lemma. The structure is exactly Ch.6 §2’s “smooth functions of many independent variables concentrate” — the smooth function here is the squared norm of Px, the many independent variables are P’s entries, and concentration kicks in fast enough that all pairs survive the union bound.

The bound from Ch.6 §3 also explains why “random pairs of unit vectors are nearly orthogonal” — same calculation in disguise. Both are concentration-of-measure results about high-D Gaussians, with the JL bound being a uniform-over-pairs version. (Vempala 2004, “The Random Projection Method,” AMS, ties them together formally.)

Now make it run

The kernel projects 100 random Gaussian points from d ∈ {32, 128, 512, 2048} down to k = 64 and measures the empirical distortion. The dimension-free claim shows up cleanly:

Johnson-Lindenstrauss distortion (N=100 points, k=64 projected dims)

d        mean ratio   std ratio   min ratio   max ratio   max |1−r|
32       0.9815       0.0916      0.6939      1.2850      0.3061
128      1.0009       0.0872      0.7194      1.3072      0.3072
512      1.0016       0.0893      0.6825      1.3619      0.3619
2048     1.0150       0.0971      0.7041      1.3735      0.3735

Predicted std ≈ 1/√k = 0.1250 for k = 64.

The std hovers at ~0.09 across a 64× change in d. The dimension-free claim is empirically airtight. The worst-case distortion is around 35% — acceptable for many applications, and would shrink to ~10% if k were boosted to 256.

jl_distortion.c (loop) C · sweep d, project to k=64, measure ratio

    double s = 0;
    for (int i = 0; i < n; i++) { double v = a[i] - b[i]; s += v * v; }
    return s;
}

int main(void) {
    const int N = 100;
    int ds[] = { 32, 128, 512, 2048 };
    int k = 64;

    printf("Johnson-Lindenstrauss distortion (N=%d points, k=%d projected dims)\n", N, k);
    printf("Sweep d to check the dimension-free claim:\n\n");
    printf("%-8s %-14s %-14s %-14s %-14s %-14s\n",
           "d", "mean ratio", "std ratio", "min ratio", "max ratio", "max |1−r|");

    for (size_t di = 0; di < sizeof(ds)/sizeof(ds[0]); di++) {
        int d = ds[di];
        double* X = malloc(sizeof(double) * (size_t)N * d);
        double* Y = malloc(sizeof(double) * (size_t)N * k);
        for (int i = 0; i < N * d; i++) X[i] = normal();

        random_project(X, N, d, Y, k);

        double sum = 0, sum2 = 0, lo = INFINITY, hi = 0, maxdev = 0;
        long pairs = 0;
        for (int i = 0; i < N; i++)
            for (int j = i + 1; j < N; j++) {
                double dX = sqrt(dist2(X + i * d, X + j * d, d));
                double dY = sqrt(dist2(Y + i * k, Y + j * k, k));
                double r = dY / dX;
                sum += r;
                sum2 += r * r;

— think, then check —

Engineer B is right. The JL lemma’s projected dimension k depends on the number of points N (logarithmically) and the precision tolerance ε (as 1/ε²) — but is completely independent of the original dimension d.

For a typical vector-search workload with N = 10 million points and ε = 0.1: k ≈ log(10⁷) / 0.01 ≈ 1,600 dimensions. So 4096-dim embeddings can be projected to ~1,600 dims and pairwise distances are preserved within 10%. Same projected dim if the originals were 100,000-dim — d doesn’t appear in the bound.

Operationally: random projection is the cheapest, simplest, most universally-applicable dimensionality reduction. The structure of the data doesn’t matter, the original dimension doesn’t matter — only N and ε. This is what makes it the preprocessing step inside QJL, TurboQuant (Ch.25), and many other compression schemes.

↳ §7.1 dimension-free

Why this is “the rotation result of Ch.6, made constructive”

Ch.6 §3 established that random unit vectors in high dimensions are nearly orthogonal — that their pairwise dot products concentrate sharply at zero with width 1/√d. JL is the constructive version of that fact: a random linear map preserves pairwise dot products and distances with high probability. The same concentration that made the geometry of random vectors predictable in §6.3 is now harnessed as an algorithm.

Stand them next to each other:

Ch.2 §3 — Orthogonal Q preserves dot products exactly. But orthogonal Q is a square matrix (d × d); no dimension reduction.
Ch.6 §3 — Random pairs of unit vectors have near-zero dot product. Descriptive, not algorithmic.
Ch.7 §1 — Random k × d projection (k ≪ d) preserves dot products approximately with high probability. Approximate + dimension-reducing + algorithmic. This is the JL form.

Each is a fact about how randomness and orthogonality interact in high dimensions. The JL form is the one with the most engineering punch — it says “compress by a random matrix, lose almost nothing.”

Why not PCA?

A reasonable question. PCA projects onto the top-k principal components — the directions of largest variance in the data. By construction, PCA is optimal in the sense of minimising reconstruction error (Frobenius norm of the difference). Why not always use PCA?

Three reasons JL is often the better choice:

	JL random projection	PCA
Cost to construct	O(dk) (just sample P)	O(d² N) (covariance + eigendecomposition)
Data dependence	None (use the same P for any data)	Need the data first
New data	Same P works	Need re-decomposition (or projection onto stale axes)
Distortion bound	(1 ± ε) on every pair, guaranteed	Optimal in average squared error, but no per-pair guarantee
Streaming-friendly	Yes (P is fixed)	No (covariance changes as data comes in)
Adversarial robustness	Random projection is unpredictable to adversary	Top-k directions are known/learnable

For applications where you need a guaranteed worst-case bound across all pairs (ANN search, fairness), JL wins. For applications where average reconstruction error matters (compression of a known dataset), PCA wins. In modern ML, JL-style random projections dominate the inference path (TurboQuant, LSH-based ANN, sketching for kernel methods); PCA dominates the data-preprocessing path (whitening features before training, dimension reduction for visualisation).

— think, then check —

(1) Construction cost. PCA needs the covariance of the data, O(d²N) flops, and an eigendecomposition. For N = 10⁹ vectors of d = 4096, that’s ~10¹⁶ flops just to set up the projection. JL: sample a random k×d matrix once — O(dk) ≈ 10⁶ flops for k = 256. Eight orders of magnitude difference.

(2) Freshness. PCA’s principal components depend on the data; if the data distribution drifts (new categories, seasonal patterns), PCA must be redecomposed or it becomes stale. JL is data-independent — the same random P works for any data, forever. No re-fitting.

(3) Worst-case guarantee. PCA minimises average reconstruction error in the Frobenius norm but gives no per-pair distance guarantee — some pairs could be highly distorted even when average reconstruction looks great. JL gives a uniform (1 ± ε) bound on every pair with probability ≥ 1 − N⁻². For ANN search where worst-case rankings matter, this guarantee is what you actually need.

For this workload — large streaming N, distance-preservation guarantees needed, fast setup required — JL is decisively better. PCA fits a different niche (fixed-dataset compression, visualisation, whitening). The two methods aren’t competing; they’re answering different questions about high-D data.

↳ §7.1 JL vs PCA

END OF CH.7 §1 — The Johnson-Lindenstrauss lemma.
Built: JLDistortion viz (slide d, watch the distortion histogram stay put; slide k, watch it tighten); jl_distortion.c demonstrates the dimension-free claim across d = 32 to d = 2048 at fixed k = 64. Three recall items: easy (statement), medium (the dimension-free claim and a concrete engineer-vs-engineer scenario), hard (JL vs PCA trade-offs for streaming workloads).
Coming next: §7.2 — The proof in detail and the Achlioptas ±1 construction. §7.3 will cover Fast JL via Hadamard, closing the loop to TurboQuant.