DISTRIBUTIONS, VARIANCE, EXPECTATION
Section 5.1
01

Random variables, expectation, variance

Every quantity in a deployed ML system is a random variable. The loss is random because the input batch is sampled. The gradient is random because the loss is. A neural network’s activations are random because the weights were trained on random data. The convergence rate of training is random; whether a prompt produces a particular completion is random; whether an attention head fires on a token is random. To reason about any of it usefully, you need to talk about distributions — what values appear with what frequency — and to summarise those distributions with two numbers: expectation (where the mass is) and variance (how spread out it is). That’s all this section establishes. Everything else in Part II builds on it.

A random variable is a number with uncertain value

Formally, a random variable is a number-valued function on a probability space. Operationally, that mathematical scaffolding is irrelevant — a random variable is just a number whose value you don’t know in advance, characterized entirely by its distribution — for each possible value, the probability (or density) of seeing it.

The most important distributions for ML:

density / pmfE[X]Var(X)
Bernoulli(p)P(X=1) = p, P(X=0) = 1−ppp(1−p)
Uniform(a, b)1/(b−a) on [a, b](a+b)/2(b−a)²/12
Normal N(μ, σ²)(1/√(2πσ²)) exp(−(x−μ)²/2σ²)μσ²
Exponential(λ)λ e^(−λx) on x ≥ 01/λ1/λ²

We’ll come back to the normal in §5.2 because it’s special. For now, every one of these is “a number whose value follows this rule.” Write X ∼ N(0, 1) to say “X is distributed as a standard normal.”

Expectation — the centre of mass

The expectation of a random variable is its long-run average:

E[X] = ∫ x · p(x) dx (continuous) = Σ x · P(X = x) (discrete)

Geometrically, E[X] is the centre of mass of the distribution — if you imagine the density as a physical object, this is where it balances. The notation is suggestive: think “expected value of X.”

Two facts about expectation worth keeping close:

— think, then check —

The integral / sum that defines expectation is linear in its argument: ∫ (x + y) · p(x, y) dx dy = ∫ x · p(x, y) dx dy + ∫ y · p(x, y) dx dy = E[X] + E[Y]. No factorisation of p(x, y) was needed — joint probabilities just get summed over, regardless of whether X and Y are correlated.

Application: the expected number of “fixed points” in a random permutation of n items is n · (1/n) = 1, regardless of n. The number of fixed points is a sum of n highly-correlated indicator variables (the i-th fixed-point indicator); their joint distribution is messy, but linearity of expectation just adds their individual means. (Same trick gives the expected number of edges in a random graph, the expected longest increasing subsequence to within constants, etc.)

In ML: linearity is why the gradient of a sum of losses is the sum of the gradients. ∇E[L(θ; x)] = E[∇L(θ; x)] — gradient and expectation commute because expectation is linear. This is the math identity that makes stochastic gradient descent legitimate: the expected SGD direction equals the true gradient of the average loss.

Variance — the spread around the centre

The variance of X is the average squared distance from the mean:

Var(X) = E[ (X − E[X])² ] = E[X²] − (E[X])² ← equivalent algebraic form, useful in proofs

The variance is always non-negative, and it’s zero exactly when X is constant. Its square root, standard deviation σ = √Var(X), is more interpretable because it has the same units as X (variance of a metres-valued RV is in m²; standard deviation in m).

Operational rules:

Estimating E[X] and Var(X) from samples — the √N law

In practice we usually don’t know E[X] and Var(X) analytically. We have N samples x_1, …, x_N drawn independently from the distribution. The standard estimators:

x̄ = (1/N) Σᵢ xᵢ ← sample mean s² = (1/(N − 1)) Σᵢ (xᵢ − x̄)² ← unbiased sample variance

The N − 1 in the denominator (rather than N) is Bessel’s correction — it makes the estimator unbiased, accounting for the fact that is itself an estimate. The intuition: you “spent” one degree of freedom estimating the mean, leaving only N−1 independent deviations to estimate the variance from.

The classical Law of Large Numbers says x̄ → E[X] as N → ∞. The interesting part is how fast. The standard error of the sample mean is:

SE(x̄) = σ / √N ⇒ |x̄ − E[X]| is typically O(1/√N)

This √N law is everywhere in ML:

To get one extra decimal digit of precision, you need to multiply N by 100. That’s the iron rule.

true mean = 0.50-0.31.3
N 0
x̄ (sample mean)
true E[X] 0.5000
mean error
SE = σ/√N
s² (sample var)
true Var(X) 0.0833
The orange dashed line is the true E[X]; the rust line is the running sample mean. The two converge as N grows — and the typical error |x̄ − E[X]| shrinks like σ/√N. Slow as a rock past N ~ 10⁴: to halve the error you must quadruple the samples. That √N law shows up everywhere in ML — gradient noise, MCMC chains, confidence intervals.
Pick a distribution; click +1 or hit play. The histogram of seen samples builds toward the true density; the running sample mean and variance drift to the true values. The interesting thing isn't that they converge — it's how slowly.

Drag the distribution; click +10k or play. Watch how slowly the sample mean (rust line) converges on the true mean (orange dashed line). At N = 1,000,000, the error is still ~10⁻³. To halve that error, you need 4 million samples. The viz is the √N law made tactile.

Now make it run — Welford’s streaming estimator

Computing variance from samples has a classic numerical hazard. The “naïve” formula Var = E[X²] − E[X]² works fine in math but in float arithmetic it subtracts two nearly-equal large numbers when the data is concentrated. Catastrophic cancellation (Ch.3 §1) destroys the variance estimate.

Welford’s algorithm is the standard fix. Each new sample updates a running mean and a running “sum of squared deviations from the current mean”; no cancellation, one pass, O(1) memory:

welford.c (loop body) C · Welford streaming update
    double delta = x - w->mean;
    w->mean += delta / w->n;
    double delta2 = x - w->mean;
    w->M2 += delta * delta2;
}
static double sample_variance(const Welford* w) {
    return w->n > 1 ? w->M2 / (w->n - 1) : 0.0;
}

int main(void) {
    Welford w = { 0, 0.0, 0.0 };

The kernel runs the estimator on 1,000,000 standard-normal samples and prints the mean/variance estimate at N = 10, 100, 1,000, …, 1,000,000 alongside the predicted standard error 1/√n:

Welford streaming estimator on N(0, 1)
N          sample mean  sample var   |mean err|     1/√N
10         0.194984     0.899576     0.194984       0.316228
100        -0.072243    0.867276     0.072243       0.100000
1000       0.017647     1.030986     0.017647       0.031623
10000      -0.005975    1.029293     0.005975       0.010000
100000     -0.002343    1.000488     0.002343       0.003162
1000000    -0.000398    0.998595     0.000398       0.001000

the ratio |mean err| / (1/√N) is O(1) — the standard-error scaling.
doubling N halves the error by only √2; quadruple N for half error.

Each row’s mean error is within a factor of ~1 of the predicted 1/√N. That ratio being O(1) is the operational statement of the standard-error rule.

Welford is also what PyTorch’s BatchNorm and LayerNorm use under the hood for the running statistics they track during inference. It’s the same kernel: one pass, no cancellation, O(1) memory. Worth recognising in production code.

— think, then check —

Gradient noise’s standard error scales as σ/√B for batch size B. To halve the standard error, you need B to grow by a factor of 4 — so new batch size is 128.

Doubling batch size to 64 only reduces the noise by √2 ≈ 1.41×. This is why “just throw more compute at it” has diminishing returns for noise reduction: each halving of noise costs 4× the compute. ML practitioners feel this constantly — going from batch 1024 to batch 4096 cuts gradient noise by 2× but quadruples step cost.

The √N law is also why ML papers report results averaged over many seeds: any single training run’s final loss has standard error ~σ/√(eval set size) baked into it, so improvements smaller than 2σ should be checked across multiple seeds before being believed.

— think, then check —

The biased estimator s̃² = (1/N) Σ(xᵢ − x̄)² underestimates the true variance because x̄ is itself estimated from the same data — it gets pulled toward the sample, making the deviations smaller than they “really” are.

Formally: E[Σ(xᵢ − x̄)²] = (N − 1) σ². Each term involves x̄, which has covariance with xᵢ. When you expand the sum and apply linearity of expectation through it, the N² in the denominator and a Σ correction collide to give the (N − 1) coefficient. Hence s² = Σ(xᵢ − x̄)² / (N − 1) has E[s²] = σ² — unbiased exactly.

Why call this “Bessel’s correction”: F.W. Bessel pointed it out in 1825. The N vs N−1 distinction is invisible at N = 10⁶ but is a real bias at N = 10. ML monitoring code that reports running variance over short windows should care.

END OF CH.5 §1 — Random variables, expectation, variance.
Built: SampleConverge viz (pick a distribution, watch the running mean converge slowly); welford.c demonstrates the √N error law on 1M N(0,1) samples and uses the numerically stable Welford update — the same kernel inside BatchNorm. Three recall items on linearity, the √N law, and Bessel’s correction.
Coming next: §5.2 — The Gaussian and the Central Limit Theorem.