Isotropic vs anisotropic

Section 5.3

Isotropic vs anisotropic

Real ML data — embeddings, attention scores, hidden-layer activations — almost never has equal spread in every direction. Some axes carry orders of magnitude more variance than others. That’s anisotropy, and it’s the structural enemy of per-coordinate quantization: a single scale either wastes bits on directions with no spread, or clips directions with too much. The fix — the one TurboQuant (Ch.25) leans on — is conceptually trivial: rotate the data first. Rotation doesn’t change the distribution, doesn’t change any pairwise dot product, doesn’t change any score (Ch.2 §3). But it does redistribute the per-coordinate variances, often dramatically. A single random rotation can collapse a 10⁴× variance imbalance to ~10². This section gives the algebra (covariance matrices, the QΣQᵀ transformation rule) and runs the experiment in 16 dimensions to confirm.

Multivariate distributions and the covariance matrix

A random vector X ∈ ℝᵈ has a multivariate distribution. Its covariance matrix core term Σ = E[(X − μ)(X − μ)ᵀ] ∈ ℝᵈˣᵈ. Diagonal entries Σᵢᵢ are the variances of individual coordinates; off-diagonal entries Σᵢⱼ are the covariances Cov(Xᵢ, Xⱼ). Σ is symmetric and positive semi-definite. Then → now: same since the 19th century. What changed is that it's a first-class object in ML — PyTorch has torch.cov, JAX has jnp.cov, and any analysis of an embedding space starts with 'what's the covariance look like.' bundles all the second-moment information about its spread and shape:

Σ = E[ (X − μ) (X − μ)ᵀ ] ∈ ℝᵈˣᵈ Σᵢⱼ = E[ (Xᵢ − μᵢ) (Xⱼ − μⱼ) ] = Cov(Xᵢ, Xⱼ) Diagonal: Σᵢᵢ = Var(Xᵢ) Off-diag: how strongly coordinate i and coordinate j move together

A few structural facts about Σ that matter:

Symmetric. Σᵢⱼ = Σⱼᵢ by definition. Its eigenvalues are all real.
Positive semi-definite. All eigenvalues are ≥ 0. (A negative eigenvalue would mean some linear combination of coordinates has negative variance, which is impossible.)
Eigenvalues = variances along principal axes. Diagonalising Σ = U Λ Uᵀ (the spectral theorem), the eigenvectors (columns of U) point along the directions of maximum/minimum variance, and the eigenvalues in Λ are those variances. The first principal component of PCA is the eigenvector corresponding to the largest eigenvalue.

Isotropic vs anisotropic

A distribution is isotropic if its covariance is proportional to the identity:

Σ_isotropic = σ² · I ← every direction has the same variance

Geometrically, the cloud of samples looks the same from any angle — a uniform fuzzball. Standard N(0, I) is isotropic by construction.

A distribution is anisotropic core term A distribution whose covariance Σ is NOT a scalar multiple of the identity — directions have unequal variance. The eigenvalues of Σ are unequal. Real ML data is wildly anisotropic. Then → now: same word, much heavier usage. Modern ML uses 'anisotropic' constantly because it's a real structural problem to deal with: embedding spaces are anisotropic, attention scores are anisotropic, weight matrices are anisotropic. Anisotropy makes per-coordinate quantization, PCA, and naive clustering harder than they'd be in a fair world. otherwise. Its covariance has unequal eigenvalues, and the cloud is elongated along some directions, squashed along others. The viz makes this tactile:

anisotropy (ratio of variances) 8.0× original orientation 25° applied rotation 45°

original

Var(x) 0.00

Var(y) 0.00

imbalance 0.0×

after rotation by 45°

Var(x) 0.00

Var(y) 0.00

imbalance 0.0×

Crank the anisotropy up. Notice the "imbalance" — the ratio of the larger per-coordinate variance to the smaller — starts near 1 (isotropic) when the distribution is aligned with the axes, and grows large when it's not. Now rotate. The right panel's per-coordinate variances change; for the right rotation, they balance out — even though the underlying distribution (the cloud's shape) is identical, just rotated.

A rotation doesn't change the distribution of the data — it preserves every pairwise dot product (Ch.2 §3). But it does change which directions are aligned with the storage axes. Quantizers care about that, because they store data per-coordinate. Picking a Q that equalises per-coordinate variance is the whole game of rotation-based quantization.

Crank the anisotropy slider. Notice the “imbalance” reading at the bottom of each panel — that’s the ratio of the larger per-coordinate variance to the smaller. When the original cloud is aligned with the x-axis, imbalance is ~ ratio (the slider value). When it’s off-axis, imbalance is somewhere between 1 and the slider value, depending on orientation. Then rotate the right panel — and the imbalance there can go either way depending on whether the rotation aligns the cloud with the storage axes or not.

The key thing to feel: the distribution itself never changed. We rotated the picture, not the underlying randomness. Yet the per-coordinate variances changed.

— think, then check —

Σ = E[(X − μ)(X − μ)ᵀ] where μ = E[X]. Σ is a d × d matrix for X ∈ ℝᵈ.

Diagonal entries Σᵢᵢ = Var(Xᵢ) — the variance of coordinate i, i.e. the spread along axis i.

Off-diagonal entries Σᵢⱼ = Cov(Xᵢ, Xⱼ) — how coordinate i and coordinate j vary together. Positive = they tend to move together; negative = they tend to move oppositely; zero = no linear relationship.

Geometrically, the eigenvalues of Σ are the variances along the principal axes (the eigenvectors of Σ) — and an isotropic distribution has Σ = σ²I, equal variance everywhere, while anisotropic Σ has eigenvalues that differ.

↳ §5.3 covariance

The transformation rule: Σ → QΣQᵀ

If Y = QX for some matrix Q, then Y’s covariance is:

Cov(Y) = E[ Y Yᵀ ] − E[Y] E[Y]ᵀ = Q · E[X Xᵀ] · Qᵀ − Q E[X] E[X]ᵀ Qᵀ = Q · Cov(X) · Qᵀ = Q Σ Qᵀ

This is the multivariate analogue of Var(aX) = a² Var(X) — applying a matrix conjugates the covariance.

When Q is orthogonal (QᵀQ = I, from Ch.2 §3), three things happen:

Pairwise dot products are preserved. ⟨Qx, Qy⟩ = ⟨x, y⟩ for every pair — the §2.3 invariant.
The distribution itself is unchanged in shape, just rotated. If X ∼ N(0, Σ), then QX ∼ N(0, QΣQᵀ) — still a Gaussian, just with a different covariance matrix.
The total variance — the trace of Σ — is invariant. tr(QΣQᵀ) = tr(QᵀQΣ) = tr(Σ). The trace is rotation-invariant (a special case of “the eigenvalues are invariant up to permutation”). So rotation doesn’t change the total spread; it only redistributes it across coordinates.

That redistribution is the lever. We can’t change the total spread, but we can choose a rotation that spreads it evenly across coordinates — equalising the diagonal of QΣQᵀ. That’s what makes per-coordinate quantization with one global scale work well.

— think, then check —

Trace is cyclic: tr(ABC) = tr(BCA) = tr(CAB) for any matrices A, B, C of compatible shapes.

For an orthogonal Q (QᵀQ = I) and any covariance Σ:

tr(QΣQᵀ) = tr(QᵀQΣ) = tr(IΣ) = tr(Σ).

So the total per-coordinate variance is conserved exactly under rotation. The rotation just redistributes the variance among coordinates — it never creates or destroys variance. This is the mathematical reason behind “rotation can’t help if you’re already isotropic” (the total was already evenly split) and “rotation can dramatically help if you’re anisotropic” (because the total is the same, equalising the diagonal means each coord gets trace(Σ) / d).

↳ §5.3 trace invariance

What random rotation buys for quantization

Here’s the operational chain that makes TurboQuant work:

You want to store each coordinate of a vector as int8 (Ch.3 §3) using a single global scale. The scale must accommodate the largest per-coordinate variance — anything smaller would clip the extreme values of that coordinate.
With anisotropic data, the largest per-coordinate variance is orders of magnitude bigger than the smallest. Storing every coord at the same scale means the small-variance coords use only a tiny fraction of int8’s range — most of their bits are unused.
If you rotate the data first by a random orthogonal Q (e.g. Hadamard), the per-coordinate variances equalise. Each coord now uses int8’s full range. The same scale is no longer wasteful.
The score computation ⟨q, v⟩ survives the rotation exactly: ⟨Qq, Qv⟩ = ⟨q, v⟩. So you score on the rotated representation and the ranking is identical to scoring on the original.

Steps 1 and 2 are the problem; step 3 (rotation) is the fix; step 4 (orthogonality) is the license to apply the fix. This section established 1–2; Ch.2 §3 established 4; the rest of the loop closes in Ch.25.

Now make it run

The kernel generates 20,000 samples from a 16-D Gaussian where the per-coordinate variances span 10⁴× (from 100 down to 0.01). It then applies a random orthogonal Q (three composed Householder reflections) and reports the post-rotation variance imbalance.

anisotropy.c (key) C · build anisotropic data, rotate, measure

    for (int n = 0; n < NSAMP; n++)
        for (int i = 0; i < DIM; i++)
            X[n][i] = per_sd[i] * normal();

    /* Compute empirical per-coord variance. */
    double var_orig[DIM] = {0};
    {
        double mean[DIM] = {0};
        for (int n = 0; n < NSAMP; n++) for (int i = 0; i < DIM; i++) mean[i] += X[n][i];
        for (int i = 0; i < DIM; i++) mean[i] /= NSAMP;
        for (int n = 0; n < NSAMP; n++)
            for (int i = 0; i < DIM; i++) {
                double d = X[n][i] - mean[i];
                var_orig[i] += d * d;
            }
        for (int i = 0; i < DIM; i++) var_orig[i] /= (NSAMP - 1);
    }

    /* Build a random orthogonal Q = H1 · H2 · H3 (three composed Householders). */
    double H1[DIM][DIM], H2[DIM][DIM], H3[DIM][DIM], T[DIM][DIM], Q[DIM][DIM];
    householder(H1); householder(H2); householder(H3);
    compose(T, H1, H2);
    compose(Q, T,  H3);

    /* Apply Q to every sample: y = Q · x. */
    static double Y[NSAMP][DIM];
    for (int n = 0; n < NSAMP; n++)
        for (int i = 0; i < DIM; i++) {
            double s = 0;
            for (int j = 0; j < DIM; j++) s += Q[i][j] * X[n][j];
            Y[n][i] = s;
        }

    double var_rot[DIM] = {0};
    {
        double mean[DIM] = {0};

The output is the entire pitch for rotation-based quantization in numbers:

anisotropic 16-D Gaussian, 20000 samples
per-coord variances range:
  original  min =       0.0098   max =      99.2468   max/min =      10100.2
  rotated   min =       0.4085   max =      48.4737   max/min =        118.7

total variance (trace of cov) is preserved (rotation invariant):
  trace(Σ_orig)    = 216.3831
  trace(Σ_rotated) = 216.3831   (should match — Q preserves trace)

the imbalance drops by 85x after one random rotation —
which is the operational case for rotation-based quantization.

Three things to read off:

The original imbalance is 10,100× — exactly the ratio you’d predict from the planted per-coordinate variances (100 / 0.01).
After one random rotation the imbalance drops to 119× — an 85× improvement from a single rotation. With multiple rotations (or a designed structured rotation like Hadamard), it can be pushed lower. Production TurboQuant uses Hadamard.
The trace is preserved to the digit — 216.38 → 216.38. The total variance is conserved exactly; only its distribution across coordinates changed.

This is the entire scientific case for TurboQuant. Ch.2 §3 said orthogonal Q preserves dot products. §5.3 (this section) says orthogonal Q can equalise per-coordinate variance. Both are mathematical facts about orthogonal matrices. Combine them and you get: store every database vector as Qv instead of v, score against Qq instead of q, and a single int8 quantizer with one global scale works for everything because the rotated coordinates have ≈ equal variance. Score quality is preserved; storage drops to 1/4 of float32; lookup throughput goes up by ~4× (Ch.3 §2’s lane-density argument). This is the picture; Ch.25 codes it up against the actual Qdrant database.

— think, then check —

Step 1 — anisotropy is the obstacle. A single int8 quantizer with one scale wastes bits on coordinates with low variance and clips coordinates with high variance. The wasted-bit ratio is the imbalance ratio max(Var(Xᵢ)) / min(Var(Xᵢ)). For real embedding spaces this is often 10²–10⁴ — meaning low-variance coords use only 1–10 of the 256 int8 levels effectively.

Step 2 — rotation redistributes variance. Applying Q to X gives a new vector QX with covariance QΣQᵀ. By trace cyclicity, tr(QΣQᵀ) = tr(Σ), so the total variance is conserved — but its distribution across coordinates can change. A well-chosen Q (random orthogonal, Hadamard, etc.) approximately equalises the diagonal of QΣQᵀ, dropping the imbalance ratio by orders of magnitude. Empirically (kernel above): 10,100× → 119× with one random Q.

Step 3 — scores survive. By Ch.2 §3, ⟨Qq, Qv⟩ = qᵀ QᵀQ v = qᵀ I v = ⟨q, v⟩ for any orthogonal Q. Both the query and the database vector go through the same rotation; the rotations cancel in the inner product. So the score after rotation is bit-identical to the score before — no information is lost, just reorganised.

Combining: every database vector v is stored as Qv (per-coord variance equalised), every query q is rotated to Qq at query time, and the score ⟨Qq, Qv⟩ = ⟨q, v⟩. The single int8 quantizer applied to the rotated coordinates is now near-optimal because variances are balanced. Storage shrinks 4× vs float32, score quality is preserved, lookup throughput goes up 4× from the SIMD-lane density argument (Ch.3 §2). That’s TurboQuant in five sentences — and every link in the chain is one of the structural facts of Ch.2, Ch.3, and this section.

↳ §5.3 + Ch.2 §3 + Ch.3 §3

END OF CH.5 — Distributions, variance, expectation.
§1 (RVs, E[X], Var(X), the √N law) · §2 (the Gaussian and the CLT) · §3 (isotropic vs anisotropic, the rotation-redistributes-variance picture).

All three sections compile and run. The Ch.5 §3 kernel demonstrates an 85× drop in variance imbalance from one random rotation on a 16-D Gaussian. Coming next: Ch.6 — High-dimensional geometry. Why high-D is weird, why “almost orthogonal” is the rule rather than the exception, and what concentration of measure buys for ML.