Train a transformer and its loss falls in a smooth, predictable power law as you add parameters, data, and compute. That predictability is the whole product: you fit a line on cheap small models and forecast a model 100× bigger before spending the money — which is exactly why labs were willing to bet nine figures on a single run.
loading…
Three knobs: N parameters, D training tokens, C total compute. They are bound by one accounting identity, and the loss bends down a clean line in log–log. Chinchilla tells you how to split the budget: about 20 tokens per parameter.
You're baking bread with a fixed amount of money. Spend it on a bigger oven (more model parameters — capacity to represent patterns) or on more flour and time (more training tokens — examples to learn from). A scaling law is the recipe that splits the money for the best loaf — and lets you taste it before you bake.
The model's size and capacity. More parameters can memorize and represent more patterns — but cost compute on every single token, forward and backward.
Total training FLOPs, your fixed budget. Bound to the other two by C ≈ 6·N·D. The only question scaling laws answer: how to spend it.
How much data the model sees. More tokens lower loss too — but most GPT-3-era models were starved: too big for the data they got.
For most of deep learning you couldn't say in advance how good a bigger model would be — you trained it and hoped. Scaling laws ended the guessing. Across more than seven orders of magnitude of compute, test loss behaves like a clean function of N, D, and C. Straight lines extrapolate. That is the magic.
Plot loss against compute on log–log axes and you get a nearly straight line. Drag the exponent and watch how steeply the curve bends — and why the real exponents (~0.05–0.1) mean brutal diminishing returns.
Read the exponents physically: multiplying N by 10 lowers loss by a factor ≈ 0.84 (10⁻⁰·⁰⁷⁶); multiplying D by 10 lowers it by ≈ 0.80. Small exponents — this is diminishing returns baked into the math.
You never get linear improvement. You get a slowly bending log–log line. Kaplan also found that shape barely matters within a wide band — width vs depth, aspect ratio — the law is about scale, not architecture.
This is the whole reason the field could justify nine-figure runs. Click to place a few cheap "anchor" models on the log–log plot. The line fits itself, then extrapolates to forecast a model far off the right edge — before a single GPU spins up on it.
Each dot is a small model you actually trained — cheap. The straight line through them is the scaling law for your data pipeline. Extend it rightward and the ★ forecast tells you the loss of the run you can't afford to gamble on.
In a real run, the fitted L is your predicted loss. If the live curve drifts off the line, that's not a discovery — it's a bug in your training run.
Fix the compute budget so C = 6·N·D is constant. Now slide N up and D must come down to pay for it. Drag the split — the loss curve below is the real Chinchilla function, and the dashed line marks the compute-optimal minimum at ~20 tokens per parameter.
Kaplan (2020) said: build a huge model, feed it modest data, stop before convergence — and that produced GPT-3 (175B params on only ~300B tokens). Chinchilla (2022) re-ran it carefully and said: scale model and data equally. Toggle the recipe and watch the budget re-allocate.
Most GPT-3-era models were too big for their data. Chinchilla's 70B, trained on 4× more data than the 280B Gopher at equal compute, beat Gopher, GPT-3 and the 530B Megatron-Turing.
Not philosophy — protocol. Kaplan used a fixed cosine-decay length across runs (under-decaying short runs) and counted embedding params differently. Fix both and the laws converge.
A 70B beating a 530B is a 7.5×-smaller model winning — and far cheaper to fine-tune and serve. Chinchilla reframed the entire cost equation of frontier AI.
Loss falls smoothly — but do capabilities appear suddenly? Wei et al. (2022) found tasks where models score near-random, then sharply jump past some scale. Schaeffer et al. (2023) pushed back: that cliff may be an artifact of the metric. Flip the metric on the same models and watch the cliff melt into a ramp.
A harsh, all-or-nothing metric (exact-match on a long answer) manufactures a sharp cliff: get one token wrong, score zero. A smooth metric (token-level log-likelihood, edit distance) on the same models reveals steady, gradual improvement. Swap the metric and the "emergence" can evaporate.
Schaeffer et al. even conjured fake emergence in vision tasks by choosing a nonlinear metric. But it's not settled: some abilities still look abrupt under smooth metrics, and "smooth underlying competence" doesn't make the practical threshold — below ~60B it just can't do this task — any less real to a user.
These are empirical curve fits — only as trustworthy as the protocol underneath. Chinchilla itself used three estimation routes that should agree, and a 2024 replication found the original's precision was overstated. The ~20:1 conclusion survived; treat the exact coefficients as soft.
| Estimation method | How it works | Result | Status |
|---|---|---|---|
| 1 · Training curves | fix model sizes, vary tokens, read each curve's minimum | ~20 tokens/param | broadly agrees |
| 2 · IsoFLOP profiles | fix compute, sweep model size, find the loss-minimizing N per FLOP level | ~20 tokens/param | broadly agrees |
| 3 · Parametric L(N,D) | fit E + A/Nᵅ + B/Dᵝ jointly to all runs | ~20 tokens/param | CIs were too narrow* |
* Epoch AI (Besiroglu et al. 2024) found the parametric method's confidence intervals implausibly narrow — "intervals this narrow would require over 600,000 experiments" when the authors likely ran fewer than 500. Their re-fit reconciled it with methods 1 and 2. The conclusion holds; the original precision did not.
Chinchilla minimizes loss for a fixed training budget. But you train once and serve forever. If inference dominates lifetime cost — it usually does — you should deliberately overtrain a smaller model far past 20:1, Llama-style, to get something cheaper to serve. Chinchilla answers the training question, not the total-cost question.
And it's a law of loss, not of intelligence. Lower cross-entropy correlates with better behavior but does not equal reasoning, factuality, or alignment. The curve doesn't optimize those.
~20:1 assumes you have the tokens. Frontier models now want trillions of unique, high-quality tokens — and the open web is finite. That hard wall drives synthetic data, multi-epoch training (which breaks the single-epoch assumption), and the pivot to test-time-compute reasoning: buy capability with inference compute instead of more pretraining data.
Architecture-independence has limits: MoE, state-space models, and attention variants each have their own scaling constants. The dense-transformer law is not universal. The exponents held 2020–2024 — but a power law carries no guarantee three orders of magnitude further out. It could bend.
The power law, the C = 6ND identity, the Chinchilla loss, the ~20:1 rule, Kaplan's correction, the emergence debate, and the honest caveats. You can now look at a compute budget and say where the money goes — and why predictability, not any single number, is the real product.
Scaling laws make dense compute expensive. MoE decouples total parameters from active FLOPs — more capacity, same cost per token. The first great cheat.