Random variables, expectation, variance

Section 5.1

Random variables, expectation, variance

Every quantity in a deployed ML system is a random variable. The loss is random because the input batch is sampled. The gradient is random because the loss is. A neural network’s activations are random because the weights were trained on random data. The convergence rate of training is random; whether a prompt produces a particular completion is random; whether an attention head fires on a token is random. To reason about any of it usefully, you need to talk about distributions — what values appear with what frequency — and to summarise those distributions with two numbers: expectation (where the mass is) and variance (how spread out it is). That’s all this section establishes. Everything else in Part II builds on it.

A random variable is a number with uncertain value

Formally, a random variable core term A function from a sample space (the set of possible outcomes of some random process) to ℝ. Notation: a capital letter like X. Examples: X = a die roll, X = the loss of one training batch, X = ‖∇L‖ at step t. The 'random' part is that we don't know which outcome will occur; the 'variable' is that we can describe it by its distribution. Then → now: same definition since Kolmogorov, but constantly invoked in modern ML. PyTorch's torch.distributions module is essentially a library of random variables; sampling, log-likelihoods, KL divergence, etc. is a number-valued function on a probability space. Operationally, that mathematical scaffolding is irrelevant — a random variable is just a number whose value you don’t know in advance, characterized entirely by its distribution — for each possible value, the probability (or density) of seeing it.

The most important distributions for ML:

	density / pmf	E[X]	Var(X)
Bernoulli(p)	P(X=1) = p, P(X=0) = 1−p	p	p(1−p)
Uniform(a, b)	1/(b−a) on [a, b]	(a+b)/2	(b−a)²/12
Normal N(μ, σ²)	(1/√(2πσ²)) exp(−(x−μ)²/2σ²)	μ	σ²
Exponential(λ)	λ e^(−λx) on x ≥ 0	1/λ	1/λ²

We’ll come back to the normal in §5.2 because it’s special. For now, every one of these is “a number whose value follows this rule.” Write X ∼ N(0, 1) to say “X is distributed as a standard normal.”

Expectation — the centre of mass

The expectation of a random variable is its long-run average:

E[X] = ∫ x · p(x) dx (continuous) = Σ x · P(X = x) (discrete)

Geometrically, E[X] is the centre of mass of the distribution — if you imagine the density as a physical object, this is where it balances. The notation is suggestive: think “expected value of X.”

Two facts about expectation worth keeping close:

Linearity is unconditional. E[aX + bY] = aE[X] + bE[Y] for any random variables X and Y, with no assumption about independence. This is the single most useful operational property — you can pull expectations through any linear combination, every time.
Functions of X are not linear in X. E[f(X)] ≠ f(E[X]) in general. For instance, E[X²] ≠ (E[X])²; the difference between them is exactly the variance (more on that in a moment).

— think, then check —

The integral / sum that defines expectation is linear in its argument: ∫ (x + y) · p(x, y) dx dy = ∫ x · p(x, y) dx dy + ∫ y · p(x, y) dx dy = E[X] + E[Y]. No factorisation of p(x, y) was needed — joint probabilities just get summed over, regardless of whether X and Y are correlated.

Application: the expected number of “fixed points” in a random permutation of n items is n · (1/n) = 1, regardless of n. The number of fixed points is a sum of n highly-correlated indicator variables (the i-th fixed-point indicator); their joint distribution is messy, but linearity of expectation just adds their individual means. (Same trick gives the expected number of edges in a random graph, the expected longest increasing subsequence to within constants, etc.)

In ML: linearity is why the gradient of a sum of losses is the sum of the gradients. ∇E[L(θ; x)] = E[∇L(θ; x)] — gradient and expectation commute because expectation is linear. This is the math identity that makes stochastic gradient descent legitimate: the expected SGD direction equals the true gradient of the average loss.

↳ §5.1 expectation

Variance — the spread around the centre

The variance of X is the average squared distance from the mean:

Var(X) = E[ (X − E[X])² ] = E[X²] − (E[X])² ← equivalent algebraic form, useful in proofs

The variance is always non-negative, and it’s zero exactly when X is constant. Its square root, standard deviation σ = √Var(X), is more interpretable because it has the same units as X (variance of a metres-valued RV is in m²; standard deviation in m).

Operational rules:

Scaling: Var(aX) = a² Var(X). (The a² rather than |a| is because variance is a squared quantity.) Equivalently, the standard deviation just scales: σ(aX) = |a| · σ(X).
Sum of independents: If X and Y are independent, Var(X + Y) = Var(X) + Var(Y). Variances add for independents. This is the variance partner of linearity of expectation, but it requires independence.
Centring: Var(X + c) = Var(X) for any constant c. Shifting doesn’t change spread.

Estimating E[X] and Var(X) from samples — the √N law

In practice we usually don’t know E[X] and Var(X) analytically. We have N samples x_1, …, x_N drawn independently from the distribution. The standard estimators:

x̄ = (1/N) Σᵢ xᵢ ← sample mean s² = (1/(N − 1)) Σᵢ (xᵢ − x̄)² ← unbiased sample variance

The N − 1 in the denominator (rather than N) is Bessel’s correction — it makes the estimator unbiased, accounting for the fact that x̄ is itself an estimate. The intuition: you “spent” one degree of freedom estimating the mean, leaving only N−1 independent deviations to estimate the variance from.

The classical Law of Large Numbers says x̄ → E[X] as N → ∞. The interesting part is how fast. The standard error of the sample mean is:

SE(x̄) = σ / √N ⇒ |x̄ − E[X]| is typically O(1/√N)

This √N law is everywhere in ML:

Monte Carlo integration: error decreases as 1/√N regardless of dimension. (That dimension-independence is why Monte Carlo is preferred over deterministic quadrature in high-D.)
Stochastic gradient descent: the noise on each step has standard error σ_∇/√B for batch size B. Quadrupling the batch halves the gradient noise.
Confidence intervals around any reported metric: loss, accuracy, BLEU, perplexity all carry a ±σ/√N uncertainty that depends on the eval-set size.

To get one extra decimal digit of precision, you need to multiply N by 100. That’s the iron rule.

N 0

x̄ (sample mean) —

true E[X] 0.5000

mean error —

SE = σ/√N —

s² (sample var) —

true Var(X) 0.0833

The orange dashed line is the true E[X]; the rust line is the running sample mean. The two converge as N grows — and the typical error |x̄ − E[X]| shrinks like σ/√N. Slow as a rock past N ~ 10⁴: to halve the error you must quadruple the samples. That √N law shows up everywhere in ML — gradient noise, MCMC chains, confidence intervals.

Pick a distribution; click +1 or hit play. The histogram of seen samples builds toward the true density; the running sample mean and variance drift to the true values. The interesting thing isn't that they converge — it's how slowly.

Drag the distribution; click +10k or play. Watch how slowly the sample mean (rust line) converges on the true mean (orange dashed line). At N = 1,000,000, the error is still ~10⁻³. To halve that error, you need 4 million samples. The viz is the √N law made tactile.

Now make it run — Welford’s streaming estimator

Computing variance from samples has a classic numerical hazard. The “naïve” formula Var = E[X²] − E[X]² works fine in math but in float arithmetic it subtracts two nearly-equal large numbers when the data is concentrated. Catastrophic cancellation (Ch.3 §1) destroys the variance estimate.

Welford’s algorithm standard practice A one-pass streaming algorithm for sample mean and sample variance that avoids catastrophic cancellation. Updates running mean and running 'sum of squared deviations from mean' as each new sample arrives. Numerically stable; widely used in monitoring, statistics libraries, GPU kernels for running normalization stats. Then → now: the algorithm is from Welford 1962. What changed is that it's now the standard implementation in every numerical library — PyTorch's running_mean / running_var in BatchNorm and LayerNorm use exactly this update, kernel-level. is the standard fix. Each new sample updates a running mean and a running “sum of squared deviations from the current mean”; no cancellation, one pass, O(1) memory:

welford.c (loop body) C · Welford streaming update

    double delta = x - w->mean;
    w->mean += delta / w->n;
    double delta2 = x - w->mean;
    w->M2 += delta * delta2;
}
static double sample_variance(const Welford* w) {
    return w->n > 1 ? w->M2 / (w->n - 1) : 0.0;
}

int main(void) {
    Welford w = { 0, 0.0, 0.0 };

The kernel runs the estimator on 1,000,000 standard-normal samples and prints the mean/variance estimate at N = 10, 100, 1,000, …, 1,000,000 alongside the predicted standard error 1/√n:

Welford streaming estimator on N(0, 1)
N          sample mean  sample var   |mean err|     1/√N
10         0.194984     0.899576     0.194984       0.316228
100        -0.072243    0.867276     0.072243       0.100000
1000       0.017647     1.030986     0.017647       0.031623
10000      -0.005975    1.029293     0.005975       0.010000
100000     -0.002343    1.000488     0.002343       0.003162
1000000    -0.000398    0.998595     0.000398       0.001000

the ratio |mean err| / (1/√N) is O(1) — the standard-error scaling.
doubling N halves the error by only √2; quadruple N for half error.

Each row’s mean error is within a factor of ~1 of the predicted 1/√N. That ratio being O(1) is the operational statement of the standard-error rule.

Welford is also what PyTorch’s BatchNorm and LayerNorm use under the hood for the running statistics they track during inference. It’s the same kernel: one pass, no cancellation, O(1) memory. Worth recognising in production code.

— think, then check —

Gradient noise’s standard error scales as σ/√B for batch size B. To halve the standard error, you need B to grow by a factor of 4 — so new batch size is 128.

Doubling batch size to 64 only reduces the noise by √2 ≈ 1.41×. This is why “just throw more compute at it” has diminishing returns for noise reduction: each halving of noise costs 4× the compute. ML practitioners feel this constantly — going from batch 1024 to batch 4096 cuts gradient noise by 2× but quadruples step cost.

The √N law is also why ML papers report results averaged over many seeds: any single training run’s final loss has standard error ~σ/√(eval set size) baked into it, so improvements smaller than 2σ should be checked across multiple seeds before being believed.

↳ §5.1 √N law

— think, then check —

The biased estimator s̃² = (1/N) Σ(xᵢ − x̄)² underestimates the true variance because x̄ is itself estimated from the same data — it gets pulled toward the sample, making the deviations smaller than they “really” are.

Formally: E[Σ(xᵢ − x̄)²] = (N − 1) σ². Each term involves x̄, which has covariance with xᵢ. When you expand the sum and apply linearity of expectation through it, the N² in the denominator and a Σ correction collide to give the (N − 1) coefficient. Hence s² = Σ(xᵢ − x̄)² / (N − 1) has E[s²] = σ² — unbiased exactly.

Why call this “Bessel’s correction”: F.W. Bessel pointed it out in 1825. The N vs N−1 distinction is invisible at N = 10⁶ but is a real bias at N = 10. ML monitoring code that reports running variance over short windows should care.

↳ §5.1 sample variance + linearity

END OF CH.5 §1 — Random variables, expectation, variance.
Built: SampleConverge viz (pick a distribution, watch the running mean converge slowly); welford.c demonstrates the √N error law on 1M N(0,1) samples and uses the numerically stable Welford update — the same kernel inside BatchNorm. Three recall items on linearity, the √N law, and Bessel’s correction.
Coming next: §5.2 — The Gaussian and the Central Limit Theorem.