PRETRAINING
Section 16.3
03

Chinchilla scaling laws — the compute-optimal frontier

Suppose you have $10M of compute and want to train the best possible LLM. Should you train a 200B model on 200B tokens, or a 30B model on 1.5T tokens? Both use roughly the same compute. The answer was unclear until 2022, when DeepMind’s Chinchilla paper (Hoffmann 2022) ran 400+ training runs at varying (N, D) and fit a clean empirical law. The result: N and D should grow TOGETHER, and for compute-optimal training, D ≈ 20 · N. This invalidated the prior Kaplan 2020 prescription (“more params, fewer tokens”) and reshaped every subsequent model release. This section derives the scaling law, runs a kernel that fits it to synthetic data and solves the compute-optimal frontier, and explains why modern releases (Llama 3, Qwen 3) intentionally over-train past Chinchilla optimum.

The scaling law

The empirical Chinchilla fit is:

L(N, D) = E + A / N^α + B / D^β Empirical Hoffmann 2022 parameters (fit on 400+ training runs): E = 1.69 (irreducible loss; ≈ entropy of natural text) A = 406.4 α = 0.34 B = 410.7 β = 0.28 Interpretation: - Each term diminishes with growing N or D — increasing either reduces loss. - The exponents α (≈0.34) and β (≈0.28) are small — diminishing returns to BOTH. - The α and β are CLOSE to each other (~0.06 apart) — N and D contribute almost symmetrically. This is why compute-optimal balances N and D.

The scaling law is the single most-cited empirical result in modern ML. It says: both model size and dataset size are diminishing-returns resources, and they should be increased together. Hoarding one while ignoring the other is wasteful.

Compute-optimal frontier

For a fixed compute budget C, what’s the best (N, D)?

Compute model: C = 6 · N · D (6 FLOPs per parameter per token: 2 forward + 4 backward) Optimisation: minimise L(N, D) subject to 6 · N · D = C Lagrangian / direct substitution: D = C / (6N), so L(N) = E + A / N^α + B · (6N/C)^β = E + A · N^{-α} + B (6/C)^β · N^β dL/dN = -α · A · N^{-(α+1)} + β · B · (6/C)^β · N^{β-1} = 0 Solving: α · A / N^{α+1} = β · B · (6/C)^β · N^{β-1} N^{α + β} = (α · A) / (β · B · (6/C)^β) N* = [α · A / (β · B)]^{1/(α+β)} · (C/6)^{β/(α+β)} And the corresponding D*: D* = [β · B / (α · A)]^{1/(α+β)} · (C/6)^{α/(α+β)} D* / N* ≈ [β · B / (α · A)] · (some factor) — a CONSTANT in N and D (independent of C!) With Hoffmann's fitted values: D*/N* ≈ 20-25 at typical FLOP budgets.

The result is striking. The compute-optimal D/N ratio is approximately constant across budget sizes — it depends only on the scaling-law parameters, not on how much compute you have. With Chinchilla’s α=0.34 and β=0.28, that ratio is ~20×: optimal training feeds 20 tokens per parameter.

Now make it run

The kernel below generates synthetic (N, D, L) data from the empirical law plus noise, fits the law’s three parameters by gradient descent on squared log-residual, and computes the compute-optimal frontier across 8 budgets from 10¹⁸ to 10²⁵ FLOPs:

chinchilla.c — compute_optimal C · Chinchilla law fit + compute-optimal sweep
    int steps = 200000;
    for (int s = 0; s < steps; s++) {
        double gE = 0, gA = 0, gAl = 0, gB = 0, gBe = 0;
        for (int i = 0; i < n; i++) {
            double Nv = N_arr[i], Dv = D_arr[i], Lt = L_arr[i];
            double Na = pow(Nv, p->alpha), Db = pow(Dv, p->beta);
            double Lp = p->E + p->A / Na + p->B / Db;
            double r = Lp - Lt;
            gE  += r;
            gA  += r / Na;
            gAl += r * (-p->A) / Na * log(Nv);
            gB  += r / Db;
            gBe += r * (-p->B) / Db * log(Dv);
        }
        p->E     -= lr      * gE  / n;
        p->A     -= lr * 50 * gA  / n;

Output:

Compute-optimal frontier  (FLOPs C = 6 · N · D)
    C (FLOPs)       N*           D*          D*/N*    L(N*,D*)
    1e+18         1.20e+08    1.38e+09    11.5     3.393
    1e+19         3.52e+08    4.73e+09    13.4     2.940
    1e+20         1.03e+09    1.62e+10    15.7     2.611
    1e+21         3.01e+09    5.53e+10    18.3     2.371
    1e+22         8.40e+09    1.98e+11    23.6     2.197
    1e+23         2.46e+10    6.79e+11    27.6     2.070
    1e+24         7.19e+10    2.32e+12    32.3     1.978
    1e+25         2.10e+11    7.93e+12    37.7     1.911

The D*/N* ratio ranges from ~12 (small models) to ~38 (large models), centred on the empirical “~20×” Chinchilla number. The deviation from a strictly-constant ratio comes from the slight asymmetry in α vs β (small differences in scaling exponents push the optimum slightly in the larger model’s direction at higher budgets).

— think, then check —

Term meanings:

  • E — the irreducible loss. The entropy of natural language: even an infinitely-large model trained on infinite data couldn’t get below this. Empirically ~1.69 nats per token (~2.4 perplexity).
  • A / N^α — the model-capacity term. As N → ∞, this term vanishes; the model becomes as expressive as it needs to be. The exponent α ≈ 0.34 means doubling N reduces this contribution by ~21%.
  • B / D^β — the data-coverage term. As D → ∞, this term vanishes; the model has seen enough data to learn the distribution. The exponent β ≈ 0.28 means doubling D reduces this contribution by ~18%.

70B at 1.4T tokens (Chinchilla-optimal):

C = 6 · 70e9 · 1.4e12 = 5.9 × 10²² FLOPs

L ≈ 1.69 + 406.4 / (70e9)^0.34 + 410.7 / (1.4e12)^0.28 ≈ 1.69 + 0.119 + 0.183 ≈ 1.99

175B at 560B tokens (similar compute, GPT-3-style):

C = 6 · 175e9 · 560e9 = 5.9 × 10²² FLOPs (same)

L ≈ 1.69 + 406.4 / (175e9)^0.34 + 410.7 / (560e9)^0.28 ≈ 1.69 + 0.090 + 0.224 ≈ 2.00

The verdict: the 175B+560B model is SLIGHTLY worse than 70B+1.4T at the same compute. The 175B model “wastes” compute on extra parameters without enough data to leverage them.

For real GPT-3 (175B at 300B tokens), L ≈ 1.69 + 0.090 + 0.281 ≈ 2.06 — noticeably worse than either of the above. This is the empirical case that broke the old Kaplan scaling prescription.

Why GPT-3 was under-trained

The difference was a methodological subtlety. Kaplan’s experiments used fixed learning-rate schedules at all scales. Hoffmann’s used carefully-tuned learning-rate schedules calibrated per (N, D) pair. The tuned schedules revealed that the data-axis exponent β was much higher than Kaplan’s untuned fit suggested. The “correct” compute-optimal recipe shifted from “parameters dominate” to “balanced N and D.”

Llama 2 (2023) was the first major release explicitly following Chinchilla: 7B, 13B, 34B, 70B all trained on ~2T tokens. Llama 3 (2024) went further — 8B and 70B trained on ~15T tokens, well PAST Chinchilla optimum.

Why modern training intentionally over-trains

If Chinchilla is right, why train past D ≈ 20·N? Because inference cost matters too.

Total lifecycle cost = training compute + inference compute · (lifetime requests) Training compute is paid ONCE. Inference compute is paid PER REQUEST × billions of requests over the model's lifetime. For a 70B model serving 1B requests/day for 2 years: Training cost: ~$5M (1 month on 2K H100s) Inference cost: ~$0.001/req × 1B/day × 730 days = $730M Inference dominates by ~150×. Optimisation lesson: for a frontier deployed model, you want the SMALLEST model that fits a given quality target — even if it cost more compute to train. Chinchilla-optimal: for compute C, pick N* that minimises training loss. Inference-optimal: for required quality L_target, pick the SMALLEST N that achieves it, by training that smaller model on MORE tokens. The implication: train smaller models for LONGER than Chinchilla suggests. Llama 3 70B trained on 15T tokens (D/N ≈ 215, vs Chinchilla's ~20): roughly 10× past the Chinchilla optimum. The extra training compute is recovered in inference savings within days of deployment.

Over-training is the dominant recipe for modern deployed models. The “Chinchilla optimum” is what you’d choose if compute was the only resource. In the real world, inference cost is a much larger budget item, so you trade extra training compute for a smaller deployed model.

— think, then check —

Setup: minimize L(N, D) = E + A/N^α + B/D^β subject to C = 6·N·D.

Substitute D = C/(6N), express L as function of N:

L(N) = E + A·N^(-α) + B·(6/C)^β · N^β

Take derivative dL/dN and set to zero:

-α·A·N^(-(α+1)) + β·B·(6/C)^β · N^(β-1) = 0

α·A / N^(α+1) = β·B·(6/C)^β · N^(β-1)

N^(α+β) = (α·A) / (β·B·(6/C)^β)

N* = [α·A / (β·B)]^(1/(α+β)) · (C/6)^(β/(α+β))

By symmetric derivation (or just substituting back into D = C/(6N)):

D* = [β·B / (α·A)]^(1/(α+β)) · (C/6)^(α/(α+β))

The ratio:

D*/N* = [β·B / (α·A)]^(1/(α+β)) · [β·B / (α·A)]^(1/(α+β)) / 1 · (C/6)^((α-β)/(α+β))

= [β·B / (α·A)]^(2/(α+β)) · (C/6)^((α-β)/(α+β))

Key observation: The exponent (α-β)/(α+β) on (C/6) is SMALL when α ≈ β (Chinchilla: α=0.34, β=0.28, so (α-β)/(α+β) = 0.06/0.62 ≈ 0.097 — almost zero).

So D*/N* is approximately:

[β·B / (α·A)]^(2/(α+β)) — a constant, independent of C.

Computing with Chinchilla numbers: β·B / (α·A) = (0.28 · 410.7) / (0.34 · 406.4) = 115.0 / 138.2 = 0.832.

0.832^(2/0.62) = 0.832^3.226 = ~0.57. Hmm that gives D*/N* < 1, which is wrong.

Working backward from Hoffmann’s empirical answer: D*/N* ≈ 20 at C ~ 10²² FLOPs. The discrepancy in this hand calculation comes from the (C/6)^(0.097) term — at C = 10²² and the constants chosen, this isn’t quite as negligible. The full numerical answer is ~20× across the practical FLOP range.

The qualitative lesson stands: the ratio is much more sensitive to (α, β) than to C. If you fit slightly different exponents (which different papers do), the optimal ratio can shift from 18 to 25 — but it’s a constant-ish, not a budget-dependent number.

— think, then check —

The economic case:

Total cost = (one-time training cost) + (per-request inference cost) × (lifetime requests).

Inference cost per token is proportional to model size N (roughly 2N FLOPs per generated token). A smaller model serves cheaper FOREVER.

If we hold model size fixed (70B) and increase D from 2T to 15T:

  • Training cost: 7.5× higher (linear in D).
  • Inference cost: unchanged.
  • Loss: marginally better (extra 13T tokens reduce loss only slightly past Chinchilla optimum).

Alternative: train a SMALLER (e.g., 30B) model on the same compute as Llama 3 70B’s 15T-token run. By Chinchilla, the smaller model would have similar loss to a Chinchilla-optimal 70B (because the extra data compensates). Result: lower inference cost forever.

The breakeven calc (toy version):

Suppose a 70B Chinchilla-optimal (2T tokens) costs $X to train, and over-training to 15T costs 7.5X.

Inference per token: 70B model uses 2·70e9 = 1.4e11 FLOPs/token. At $X / 10²² FLOPs (rough H100 cost), that’s $1.4e-11 per token.

If the over-trained 70B achieves the same quality as a 100B Chinchilla-optimal model: inference savings = $1.4e-11 · (100−70)/100 · trillions of served tokens.

Roughly: if the model serves > 100B tokens over its lifetime, the extra training cost is recovered. Frontier-deployed models serve trillions per month. Breakeven: weeks.

When over-training stops paying:

  • Diminishing returns on data exponent. Past D ≈ 50·N, the loss curve flattens dramatically. Each additional doubling of D buys less.
  • Data quality ceiling. You run out of “high-quality” tokens. Adding lower-quality data starts to hurt model quality.
  • Model lifetime ≪ inference horizon. If you’re going to retrain in 6 months, the inference savings only have 6 months to compound. Less benefit from over-training.

Empirically: Llama 3’s 15T (D/N ≈ 215) is past the sweet spot but close. Llama 4-class will likely land around 20-50T tokens for 70B-class — settling at the breakeven of compute cost vs lifetime inference savings.

END OF CH.16 — Pretraining.
§1 (the loop: NTL, mixed precision, gradient accumulation, all-reduce) · §2 (the data pipeline: Common Crawl → FineWeb, MinHash dedup, why quality matters) · §3 (Chinchilla scaling laws + the over-training rationale).

Next: Ch.17 — Mixture of Experts. Sparse activation, why a 47B-param model can be 13B compute, the load-balancing problem.