Concentration of measure

Section 6.2

Concentration of measure

§6.1 made high dimensions look hopeless. This section gives back the structural fact that makes them workable — and in a few places, makes them better than low dimensions for what ML wants to do. Concentration of measure core result The phenomenon that in high-dimensional probability spaces, smooth functions of many independent random variables are 'essentially constant' — their values cluster sharply around the expectation. Manifests as: high-D Gaussian samples concentrate on a thin shell at radius √d; the projection of a random unit vector onto a fixed axis is concentrated near 0; random walks deviate from expectation by sub-Gaussian tails. Foundational for random projections (Ch.7 JL), high-D estimation, ANN navigation, and most of statistical learning theory. is the phenomenon that, despite the curse, certain quantities in high dimensions are extraordinarily predictable. The norm of a random Gaussian vector — a number you’d think wanders wildly across many dimensions — is in fact concentrated on a vanishingly thin shell. The cleanest version of this fact, which we’ll run in code, is also the most useful one for ML: it’s what makes random projections (Ch.7) work, what makes HNSW navigation predictable, and what makes the geometry of attention scores tractable.

The Gaussian shell

For X ∼ N(0, I_d) — a standard multivariate Gaussian in d dimensions — what does ‖X‖ look like? Each coordinate X_i is an independent N(0, 1); ‖X‖² = Σ X_i² is therefore a chi-squared distribution with d degrees of freedom, which has mean d and variance 2d.

So E[‖X‖²] = d exactly. By Jensen’s inequality and a Taylor expansion, E[‖X‖] ≈ √d to leading order. The interesting part is the variance:

Var(‖X‖) ≈ 1/2 (independent of d, for large d)

The absolute width of the distribution of ‖X‖ is a constant. The relative width σ/E[‖X‖] = O(1/√d) shrinks as d grows. So almost every sample lands within a vanishingly small relative neighbourhood of √d:

Concentration of measure: ‖X‖ for X ∼ N(0, I_d)

d      √d         mean ‖X‖   std ‖X‖    min ‖X‖    max ‖X‖
2      1.41       1.26       0.66       0.03       4.37
10     3.16       3.09       0.70       0.83       6.35
50     7.07       7.05       0.70       4.41       9.94
100    10.00      9.98       0.70       7.23       12.54
500    22.36      22.36      0.71       19.68      24.97
1000   31.62      31.61      0.71       28.99      34.22

The empirical std is constant at ~0.7 regardless of dimension. At d = 1000 the worst sample out of 10,000 had norm 28.99 — within 9% of √1000 = 31.62. The Gaussian “ball” isn’t really a ball at all — it’s a Gaussian shell:

dimension d d = 50

√d (predicted mean)7.07

empirical mean7.04

empirical std0.709

relative width σ/√d0.100

Push d up. The histogram drifts to higher radii and narrows: the mean follows √d while the absolute standard deviation stays near 1/√2 ≈ 0.71. So the relative width σ/√d shrinks like 1/√d. By d = 100 almost every sample is within 5% of the predicted shell.

"Concentration of measure" made tangible: in high dimensions, almost ALL standard-Gaussian samples land at distance ≈ √d from the origin. The multivariate density's peak is at zero, but the multivariate density's mass sits on a thin shell — a counterintuitive but rigorous consequence of integrating in many variables at once.

Slide d. Watch the histogram both move (mean drifts to √d) and narrow (absolute std stays ≈ 0.7, relative width shrinks). The qualitative pattern is the most counter-intuitive fact in high-dimensional probability: almost zero mass is at the origin, even though the density’s peak is there.

The intuitive paradox: the density of N(0, I_d) is maximised at the origin. How can almost no probability be there? The resolution is that the density measures probability per unit volume, but volume in a thin shell at radius r is proportional to r^(d−1), which is enormous in high d. The density falls off exponentially with ‖x‖², but the volume rises polynomially with ‖x‖^(d−1). They meet at the shell. Same fact, different vocabulary: the density × volume — i.e. the radial measure — concentrates sharply near √d.

— think, then check —

E[‖X‖] ≈ √d. The variance of ‖X‖ is roughly 1/2 (a constant independent of d), so the standard deviation is ~ 0.7. Therefore the relative width σ / E[‖X‖] ≈ 0.7 / √d, which shrinks as 1/√d.

Concretely: at d = 100, almost every sample of N(0, I_100) has norm within a few percent of 10. The density peaks at the origin, but almost no probability mass sits near there — it concentrates on a thin shell at radius √d. The d=2/d=3 intuition for what a Gaussian “looks like” is misleading.

↳ §6.2 shell

The general principle — Lipschitz functions are nearly constant

Concentration of measure isn’t just about the norm. The general phenomenon: any sufficiently smooth function of many independent random variables is nearly constant. The precise statement requires defining “smooth” — typically as L-Lipschitz (output changes by at most L times the input change). For X ∼ N(0, I_d) and f L-Lipschitz, there’s a clean tail bound:

P( |f(X) − E[f(X)]| > t ) ≤ 2 exp( − t² / (2 L²) ) (no dependence on d!)

This is the Gaussian Lipschitz inequality. The norm f(X) = ‖X‖ is 1-Lipschitz (by the triangle inequality), so the bound applies — and explains why the absolute width of ‖X‖ stays at ~0.7 regardless of dimension.

The same theorem applies to any other 1-Lipschitz function:

The projection f(X) = ⟨v, X⟩ onto any fixed unit vector v is 1-Lipschitz and Gaussian (specifically N(0, 1)). It’s concentrated near 0 to within ±1 standard deviation regardless of dimension.
A neural network with bounded weights is L-Lipschitz with L computable from the weight norms. Its output, on random Gaussian input, is concentrated around its mean.
The largest eigenvalue of a random matrix is Lipschitz in the matrix entries; it concentrates around the asymptotic value (the Marchenko-Pastur edge for sample covariance, etc.).

This is the structural reason why so much high-D statistics has predictable behaviour: smooth aggregate quantities don’t fluctuate randomly across dimensions, they nail down to their means.

— think, then check —

An absolute width of 0.7 means very different things at d = 4 (where E[‖X‖] = 2 and 0.7/2 = 35% relative width — wide) versus d = 1000 (where E[‖X‖] ≈ 31.6 and 0.7/31.6 = 2.2% relative width — sharp).

For nearest-neighbour search, ANN navigation, and quantization error budgets, what matters is whether a quantity can be predicted to within a fraction of its size — not within a fixed absolute number. The relative width shrinking as 1/√d means: as you go to higher dimensions, you can predict the norm of a random Gaussian sample more precisely (in percent) — even though you can predict it no better in absolute units.

The same relative-vs-absolute distinction applies everywhere concentration arguments come up in ML: dot products of unit vectors (Ch.6 §3) concentrate near 0 with absolute std ~1/√d, gradient norms in high-D NN training concentrate, etc. The 1/√d relative law is what makes algorithms predictable; the constant absolute spread is just a feature of the underlying Gaussian.

↳ §6.2 absolute vs relative concentration

Concentration is a feature, not a bug

Stand §6.1 and §6.2 next to each other. The curse said “high-D nearest neighbour is hopeless because every pair is roughly the same distance.” Concentration says “in high-D, smooth functions of many variables are nearly constant.” Both are correct, both have the same underlying mechanism, and they’re really two sides of one phenomenon:

Curse view: the distance from a query to a database vector is one of these “smooth aggregate” quantities. It concentrates sharply around its mean. The distance from your query to its TRUE nearest neighbour and to its 100th-nearest neighbour are nearly identical — that’s the curse.
Concentration view: the SAME concentration means that a random projection of a vector approximately preserves its norm. So you can map a 1000-D space down to 100-D by random projection and pairwise distances are mostly preserved — that’s the Johnson-Lindenstrauss lemma (Ch.7).

The math is identical. The framing is different. Whether you call it “curse” or “blessing” depends on what you wanted to do with the metric.

gaussian_shell.c (loop) C · sample ‖X‖, watch concentration

    printf("%-6s %-12s %-12s %-12s %-12s %-14s\n",
           "d", "√d", "mean ‖X‖", "std ‖X‖", "min ‖X‖", "max ‖X‖");

    double* x = malloc(sizeof(double) * 1024);
    for (size_t i = 0; i < sizeof(dims)/sizeof(dims[0]); i++) {
        int d = dims[i];
        double sum = 0.0, sum2 = 0.0;
        double lo = INFINITY, hi = 0.0;
        for (int n = 0; n < NSAMP; n++) {
            for (int j = 0; j < d; j++) x[j] = normal();
            double r = norm(x, d);
            sum += r;
            sum2 += r * r;
            if (r < lo) lo = r;
            if (r > hi) hi = r;
        }
        double mean = sum / NSAMP;
        double var = (sum2 / NSAMP) - mean * mean;
        double sd = sqrt(var);
        printf("%-6d %-12.4f %-12.4f %-12.4f %-12.4f %-14.4f\n",
               d, sqrt((double)d), mean, sd, lo, hi);
    }

The kernel confirms the theory to several decimals across d = 2 to d = 1000. Empirical mean of ‖X‖ matches √d to within 0.05 at every dimension; empirical std stays at 0.70 throughout. By d = 1000, the worst out of 10,000 samples had norm within 3 of √1000. The shell is real and the constants are tight.

— think, then check —

They’re not contradictory; they’re complementary statements about the same concentration phenomenon, viewed at different scales.

Curse view (pessimistic framing): in a high-D dataset, pairwise distances ALL look similar. So the spread of distances between pairs is small relative to the typical distance — the “max − min” range over all pairs concentrates. That makes “find THE nearest” meaningless: the differences are smaller than the noise.

JL view (optimistic framing): the SAME concentration means a random low-D projection (from d-dim to k-dim, k ≪ d) preserves the exact value of each pairwise distance approximately, with relative error ε at the cost of k ∼ O(log N / ε²) dimensions. The projected distance for a single pair is a sum of squared independent Gaussians (chi-squared) — it concentrates around its mean exactly because of the §6.2 shell argument. So the absolute distance value is reliably preserved.

The curse says “distances within the dataset don’t help you discriminate among themselves.” JL says “distance values are individually predictable across random projections.” Both are right; they’re answering different operational questions. The vector-search engineer uses BOTH facts: the curse says exact NN is pointless (use ANN), JL says random projections + ANN preserve enough geometric structure for ANN to work in practice.

↳ §6.2 + §6.1

END OF CH.6 §2 — Concentration of measure.
Built: GaussianShell viz (drag d, see the norm distribution narrow and drift to √d); gaussian_shell.c sweeps d=2 to d=1000, confirms E[‖X‖] = √d and σ ≈ 0.7 across the board. Three recall items: easy (shell claim), medium (relative vs absolute width), hard (curse and JL as two faces of one phenomenon).
Coming next: §6.3 — Random unit vectors are nearly orthogonal. The capacity-of-high-D-embeddings argument and the rotation-equalisation pay-off.