Random variables, expectation, variance
Every quantity in a deployed ML system is a random variable. The loss is random because the input batch is sampled. The gradient is random because the loss is. A neural network’s activations are random because the weights were trained on random data. The convergence rate of training is random; whether a prompt produces a particular completion is random; whether an attention head fires on a token is random. To reason about any of it usefully, you need to talk about distributions — what values appear with what frequency — and to summarise those distributions with two numbers: expectation (where the mass is) and variance (how spread out it is). That’s all this section establishes. Everything else in Part II builds on it.
A random variable is a number with uncertain value
Formally, a random variable is a number-valued function on a probability space. Operationally, that mathematical scaffolding is irrelevant — a random variable is just a number whose value you don’t know in advance, characterized entirely by its distribution — for each possible value, the probability (or density) of seeing it.
The most important distributions for ML:
| density / pmf | E[X] | Var(X) | |
|---|---|---|---|
| Bernoulli(p) | P(X=1) = p, P(X=0) = 1−p | p | p(1−p) |
| Uniform(a, b) | 1/(b−a) on [a, b] | (a+b)/2 | (b−a)²/12 |
| Normal N(μ, σ²) | (1/√(2πσ²)) exp(−(x−μ)²/2σ²) | μ | σ² |
| Exponential(λ) | λ e^(−λx) on x ≥ 0 | 1/λ | 1/λ² |
We’ll come back to the normal in §5.2 because it’s special. For now, every one of these is “a number whose value follows this rule.” Write X ∼ N(0, 1) to say “X is distributed as a standard normal.”
Expectation — the centre of mass
The expectation of a random variable is its long-run average:
Geometrically, E[X] is the centre of mass of the distribution — if you imagine the density as a physical object, this is where it balances. The notation is suggestive: think “expected value of X.”
Two facts about expectation worth keeping close:
- Linearity is unconditional. E[aX + bY] = aE[X] + bE[Y] for any random variables X and Y, with no assumption about independence. This is the single most useful operational property — you can pull expectations through any linear combination, every time.
- Functions of X are not linear in X. E[f(X)] ≠ f(E[X]) in general. For instance, E[X²] ≠ (E[X])²; the difference between them is exactly the variance (more on that in a moment).
The integral / sum that defines expectation is linear in its argument: ∫ (x + y) · p(x, y) dx dy = ∫ x · p(x, y) dx dy + ∫ y · p(x, y) dx dy = E[X] + E[Y]. No factorisation of p(x, y) was needed — joint probabilities just get summed over, regardless of whether X and Y are correlated.
Application: the expected number of “fixed points” in a random permutation of n items is n · (1/n) = 1, regardless of n. The number of fixed points is a sum of n highly-correlated indicator variables (the i-th fixed-point indicator); their joint distribution is messy, but linearity of expectation just adds their individual means. (Same trick gives the expected number of edges in a random graph, the expected longest increasing subsequence to within constants, etc.)
In ML: linearity is why the gradient of a sum of losses is the sum of the gradients. ∇E[L(θ; x)] = E[∇L(θ; x)] — gradient and expectation commute because expectation is linear. This is the math identity that makes stochastic gradient descent legitimate: the expected SGD direction equals the true gradient of the average loss.
Variance — the spread around the centre
The variance of X is the average squared distance from the mean:
The variance is always non-negative, and it’s zero exactly when X is constant. Its square root, standard deviation σ = √Var(X), is more interpretable because it has the same units as X (variance of a metres-valued RV is in m²; standard deviation in m).
Operational rules:
- Scaling: Var(aX) = a² Var(X). (The a² rather than |a| is because variance is a squared quantity.) Equivalently, the standard deviation just scales: σ(aX) = |a| · σ(X).
- Sum of independents: If X and Y are independent, Var(X + Y) = Var(X) + Var(Y). Variances add for independents. This is the variance partner of linearity of expectation, but it requires independence.
- Centring: Var(X + c) = Var(X) for any constant c. Shifting doesn’t change spread.
Estimating E[X] and Var(X) from samples — the √N law
In practice we usually don’t know E[X] and Var(X) analytically. We have N samples x_1, …, x_N drawn independently from the distribution. The standard estimators:
The N − 1 in the denominator (rather than N) is Bessel’s correction — it makes the estimator unbiased, accounting for the fact that x̄ is itself an estimate. The intuition: you “spent” one degree of freedom estimating the mean, leaving only N−1 independent deviations to estimate the variance from.
The classical Law of Large Numbers says x̄ → E[X] as N → ∞. The interesting part is how fast. The standard error of the sample mean is:
This √N law is everywhere in ML:
- Monte Carlo integration: error decreases as 1/√N regardless of dimension. (That dimension-independence is why Monte Carlo is preferred over deterministic quadrature in high-D.)
- Stochastic gradient descent: the noise on each step has standard error σ_∇/√B for batch size B. Quadrupling the batch halves the gradient noise.
- Confidence intervals around any reported metric: loss, accuracy, BLEU, perplexity all carry a ±σ/√N uncertainty that depends on the eval-set size.
To get one extra decimal digit of precision, you need to multiply N by 100. That’s the iron rule.
Drag the distribution; click +10k or play. Watch how slowly the sample mean (rust line) converges on the true mean (orange dashed line). At N = 1,000,000, the error is still ~10⁻³. To halve that error, you need 4 million samples. The viz is the √N law made tactile.
Now make it run — Welford’s streaming estimator
Computing variance from samples has a classic numerical hazard. The “naïve” formula Var = E[X²] − E[X]² works fine in math but in float arithmetic it subtracts two nearly-equal large numbers when the data is concentrated. Catastrophic cancellation (Ch.3 §1) destroys the variance estimate.
Welford’s algorithm is the standard fix. Each new sample updates a running mean and a running “sum of squared deviations from the current mean”; no cancellation, one pass, O(1) memory:
double delta = x - w->mean;
w->mean += delta / w->n;
double delta2 = x - w->mean;
w->M2 += delta * delta2;
}
static double sample_variance(const Welford* w) {
return w->n > 1 ? w->M2 / (w->n - 1) : 0.0;
}
int main(void) {
Welford w = { 0, 0.0, 0.0 };
The kernel runs the estimator on 1,000,000 standard-normal samples and prints the mean/variance estimate at N = 10, 100, 1,000, …, 1,000,000 alongside the predicted standard error 1/√n:
Welford streaming estimator on N(0, 1)
N sample mean sample var |mean err| 1/√N
10 0.194984 0.899576 0.194984 0.316228
100 -0.072243 0.867276 0.072243 0.100000
1000 0.017647 1.030986 0.017647 0.031623
10000 -0.005975 1.029293 0.005975 0.010000
100000 -0.002343 1.000488 0.002343 0.003162
1000000 -0.000398 0.998595 0.000398 0.001000
the ratio |mean err| / (1/√N) is O(1) — the standard-error scaling.
doubling N halves the error by only √2; quadruple N for half error.
Each row’s mean error is within a factor of ~1 of the predicted 1/√N. That ratio being O(1) is the operational statement of the standard-error rule.
Welford is also what PyTorch’s BatchNorm and LayerNorm use under the hood for the running statistics they track during inference. It’s the same kernel: one pass, no cancellation, O(1) memory. Worth recognising in production code.
Gradient noise’s standard error scales as σ/√B for batch size B. To halve the standard error, you need B to grow by a factor of 4 — so new batch size is 128.
Doubling batch size to 64 only reduces the noise by √2 ≈ 1.41×. This is why “just throw more compute at it” has diminishing returns for noise reduction: each halving of noise costs 4× the compute. ML practitioners feel this constantly — going from batch 1024 to batch 4096 cuts gradient noise by 2× but quadruples step cost.
The √N law is also why ML papers report results averaged over many seeds: any single training run’s final loss has standard error ~σ/√(eval set size) baked into it, so improvements smaller than 2σ should be checked across multiple seeds before being believed.
The biased estimator s̃² = (1/N) Σ(xᵢ − x̄)² underestimates the true variance because x̄ is itself estimated from the same data — it gets pulled toward the sample, making the deviations smaller than they “really” are.
Formally: E[Σ(xᵢ − x̄)²] = (N − 1) σ². Each term involves x̄, which has covariance with xᵢ. When you expand the sum and apply linearity of expectation through it, the N² in the denominator and a Σ correction collide to give the (N − 1) coefficient. Hence s² = Σ(xᵢ − x̄)² / (N − 1) has E[s²] = σ² — unbiased exactly.
Why call this “Bessel’s correction”: F.W. Bessel pointed it out in 1825. The N vs N−1 distinction is invisible at N = 10⁶ but is a real bias at N = 10. ML monitoring code that reports running variance over short windows should care.
END OF CH.5 §1 — Random variables, expectation, variance.
Built: SampleConverge viz (pick a distribution, watch the running mean converge slowly); welford.c demonstrates the √N error law on 1M N(0,1) samples and uses the numerically stable Welford update — the same kernel inside BatchNorm. Three recall items on linearity, the √N law, and Bessel’s correction.
Coming next: §5.2 — The Gaussian and the Central Limit Theorem.