Chinchilla scaling laws — the compute-optimal frontier
Suppose you have $10M of compute and want to train the best possible LLM. Should you train a 200B model on 200B tokens, or a 30B model on 1.5T tokens? Both use roughly the same compute. The answer was unclear until 2022, when DeepMind’s Chinchilla paper (Hoffmann 2022) ran 400+ training runs at varying (N, D) and fit a clean empirical law. The result: N and D should grow TOGETHER, and for compute-optimal training, D ≈ 20 · N. This invalidated the prior Kaplan 2020 prescription (“more params, fewer tokens”) and reshaped every subsequent model release. This section derives the scaling law, runs a kernel that fits it to synthetic data and solves the compute-optimal frontier, and explains why modern releases (Llama 3, Qwen 3) intentionally over-train past Chinchilla optimum.
The scaling law
The empirical Chinchilla fit is:
The scaling law is the single most-cited empirical result in modern ML. It says: both model size and dataset size are diminishing-returns resources, and they should be increased together. Hoarding one while ignoring the other is wasteful.
Compute-optimal frontier
For a fixed compute budget C, what’s the best (N, D)?
The result is striking. The compute-optimal D/N ratio is approximately constant across budget sizes — it depends only on the scaling-law parameters, not on how much compute you have. With Chinchilla’s α=0.34 and β=0.28, that ratio is ~20×: optimal training feeds 20 tokens per parameter.
Now make it run
The kernel below generates synthetic (N, D, L) data from the empirical law plus noise, fits the law’s three parameters by gradient descent on squared log-residual, and computes the compute-optimal frontier across 8 budgets from 10¹⁸ to 10²⁵ FLOPs:
int steps = 200000;
for (int s = 0; s < steps; s++) {
double gE = 0, gA = 0, gAl = 0, gB = 0, gBe = 0;
for (int i = 0; i < n; i++) {
double Nv = N_arr[i], Dv = D_arr[i], Lt = L_arr[i];
double Na = pow(Nv, p->alpha), Db = pow(Dv, p->beta);
double Lp = p->E + p->A / Na + p->B / Db;
double r = Lp - Lt;
gE += r;
gA += r / Na;
gAl += r * (-p->A) / Na * log(Nv);
gB += r / Db;
gBe += r * (-p->B) / Db * log(Dv);
}
p->E -= lr * gE / n;
p->A -= lr * 50 * gA / n;Output:
Compute-optimal frontier (FLOPs C = 6 · N · D)
C (FLOPs) N* D* D*/N* L(N*,D*)
1e+18 1.20e+08 1.38e+09 11.5 3.393
1e+19 3.52e+08 4.73e+09 13.4 2.940
1e+20 1.03e+09 1.62e+10 15.7 2.611
1e+21 3.01e+09 5.53e+10 18.3 2.371
1e+22 8.40e+09 1.98e+11 23.6 2.197
1e+23 2.46e+10 6.79e+11 27.6 2.070
1e+24 7.19e+10 2.32e+12 32.3 1.978
1e+25 2.10e+11 7.93e+12 37.7 1.911
The D*/N* ratio ranges from ~12 (small models) to ~38 (large models), centred on the empirical “~20×” Chinchilla number. The deviation from a strictly-constant ratio comes from the slight asymmetry in α vs β (small differences in scaling exponents push the optimum slightly in the larger model’s direction at higher budgets).
Term meanings:
- E — the irreducible loss. The entropy of natural language: even an infinitely-large model trained on infinite data couldn’t get below this. Empirically ~1.69 nats per token (~2.4 perplexity).
- A / N^α — the model-capacity term. As N → ∞, this term vanishes; the model becomes as expressive as it needs to be. The exponent α ≈ 0.34 means doubling N reduces this contribution by ~21%.
- B / D^β — the data-coverage term. As D → ∞, this term vanishes; the model has seen enough data to learn the distribution. The exponent β ≈ 0.28 means doubling D reduces this contribution by ~18%.
70B at 1.4T tokens (Chinchilla-optimal):
C = 6 · 70e9 · 1.4e12 = 5.9 × 10²² FLOPs
L ≈ 1.69 + 406.4 / (70e9)^0.34 + 410.7 / (1.4e12)^0.28 ≈ 1.69 + 0.119 + 0.183 ≈ 1.99
175B at 560B tokens (similar compute, GPT-3-style):
C = 6 · 175e9 · 560e9 = 5.9 × 10²² FLOPs (same)
L ≈ 1.69 + 406.4 / (175e9)^0.34 + 410.7 / (560e9)^0.28 ≈ 1.69 + 0.090 + 0.224 ≈ 2.00
The verdict: the 175B+560B model is SLIGHTLY worse than 70B+1.4T at the same compute. The 175B model “wastes” compute on extra parameters without enough data to leverage them.
For real GPT-3 (175B at 300B tokens), L ≈ 1.69 + 0.090 + 0.281 ≈ 2.06 — noticeably worse than either of the above. This is the empirical case that broke the old Kaplan scaling prescription.
Why GPT-3 was under-trained
The difference was a methodological subtlety. Kaplan’s experiments used fixed learning-rate schedules at all scales. Hoffmann’s used carefully-tuned learning-rate schedules calibrated per (N, D) pair. The tuned schedules revealed that the data-axis exponent β was much higher than Kaplan’s untuned fit suggested. The “correct” compute-optimal recipe shifted from “parameters dominate” to “balanced N and D.”
Llama 2 (2023) was the first major release explicitly following Chinchilla: 7B, 13B, 34B, 70B all trained on ~2T tokens. Llama 3 (2024) went further — 8B and 70B trained on ~15T tokens, well PAST Chinchilla optimum.
Why modern training intentionally over-trains
If Chinchilla is right, why train past D ≈ 20·N? Because inference cost matters too.
Over-training is the dominant recipe for modern deployed models. The “Chinchilla optimum” is what you’d choose if compute was the only resource. In the real world, inference cost is a much larger budget item, so you trade extra training compute for a smaller deployed model.
Setup: minimize L(N, D) = E + A/N^α + B/D^β subject to C = 6·N·D.
Substitute D = C/(6N), express L as function of N:
L(N) = E + A·N^(-α) + B·(6/C)^β · N^β
Take derivative dL/dN and set to zero:
-α·A·N^(-(α+1)) + β·B·(6/C)^β · N^(β-1) = 0
α·A / N^(α+1) = β·B·(6/C)^β · N^(β-1)
N^(α+β) = (α·A) / (β·B·(6/C)^β)
N* = [α·A / (β·B)]^(1/(α+β)) · (C/6)^(β/(α+β))
By symmetric derivation (or just substituting back into D = C/(6N)):
D* = [β·B / (α·A)]^(1/(α+β)) · (C/6)^(α/(α+β))
The ratio:
D*/N* = [β·B / (α·A)]^(1/(α+β)) · [β·B / (α·A)]^(1/(α+β)) / 1 · (C/6)^((α-β)/(α+β))
= [β·B / (α·A)]^(2/(α+β)) · (C/6)^((α-β)/(α+β))
Key observation: The exponent (α-β)/(α+β) on (C/6) is SMALL when α ≈ β (Chinchilla: α=0.34, β=0.28, so (α-β)/(α+β) = 0.06/0.62 ≈ 0.097 — almost zero).
So D*/N* is approximately:
[β·B / (α·A)]^(2/(α+β)) — a constant, independent of C.
Computing with Chinchilla numbers: β·B / (α·A) = (0.28 · 410.7) / (0.34 · 406.4) = 115.0 / 138.2 = 0.832.
0.832^(2/0.62) = 0.832^3.226 = ~0.57. Hmm that gives D*/N* < 1, which is wrong.
Working backward from Hoffmann’s empirical answer: D*/N* ≈ 20 at C ~ 10²² FLOPs. The discrepancy in this hand calculation comes from the (C/6)^(0.097) term — at C = 10²² and the constants chosen, this isn’t quite as negligible. The full numerical answer is ~20× across the practical FLOP range.
The qualitative lesson stands: the ratio is much more sensitive to (α, β) than to C. If you fit slightly different exponents (which different papers do), the optimal ratio can shift from 18 to 25 — but it’s a constant-ish, not a budget-dependent number.
The economic case:
Total cost = (one-time training cost) + (per-request inference cost) × (lifetime requests).
Inference cost per token is proportional to model size N (roughly 2N FLOPs per generated token). A smaller model serves cheaper FOREVER.
If we hold model size fixed (70B) and increase D from 2T to 15T:
- Training cost: 7.5× higher (linear in D).
- Inference cost: unchanged.
- Loss: marginally better (extra 13T tokens reduce loss only slightly past Chinchilla optimum).
Alternative: train a SMALLER (e.g., 30B) model on the same compute as Llama 3 70B’s 15T-token run. By Chinchilla, the smaller model would have similar loss to a Chinchilla-optimal 70B (because the extra data compensates). Result: lower inference cost forever.
The breakeven calc (toy version):
Suppose a 70B Chinchilla-optimal (2T tokens) costs $X to train, and over-training to 15T costs 7.5X.
Inference per token: 70B model uses 2·70e9 = 1.4e11 FLOPs/token. At $X / 10²² FLOPs (rough H100 cost), that’s $1.4e-11 per token.
If the over-trained 70B achieves the same quality as a 100B Chinchilla-optimal model: inference savings = $1.4e-11 · (100−70)/100 · trillions of served tokens.
Roughly: if the model serves > 100B tokens over its lifetime, the extra training cost is recovered. Frontier-deployed models serve trillions per month. Breakeven: weeks.
When over-training stops paying:
- Diminishing returns on data exponent. Past D ≈ 50·N, the loss curve flattens dramatically. Each additional doubling of D buys less.
- Data quality ceiling. You run out of “high-quality” tokens. Adding lower-quality data starts to hurt model quality.
- Model lifetime ≪ inference horizon. If you’re going to retrain in 6 months, the inference savings only have 6 months to compound. Less benefit from over-training.
Empirically: Llama 3’s 15T (D/N ≈ 215) is past the sweet spot but close. Llama 4-class will likely land around 20-50T tokens for 70B-class — settling at the breakeven of compute cost vs lifetime inference savings.
END OF CH.16 — Pretraining.
§1 (the loop: NTL, mixed precision, gradient accumulation, all-reduce) ·
§2 (the data pipeline: Common Crawl → FineWeb, MinHash dedup, why quality matters) ·
§3 (Chinchilla scaling laws + the over-training rationale).
Next: Ch.17 — Mixture of Experts. Sparse activation, why a 47B-param model can be 13B compute, the load-balancing problem.