openalicelabs / academy
COURSE ARCH-01 LESSON 01 · 06 TOPIC SCALING LAWS EST. READ ~12 MIN
OPENALICE LABORATORIES · EDUCATION PATH · ARCHITECTURE 01 · 06

Predict the
model before
you build it.

Train a transformer and its loss falls in a smooth, predictable power law as you add parameters, data, and compute. That predictability is the whole product: you fit a line on cheap small models and forecast a model 100× bigger before spending the money — which is exactly why labs were willing to bet nine figures on a single run.

FIG.00 — LOSS vs COMPUTE
loading…
FIG.0A — THE ENGINE · compute C = 6·N·D · loss L(N,D) = E + A/Nᵅ + B/Dᵝ

Three knobs: N parameters, D training tokens, C total compute. They are bound by one accounting identity, and the loss bends down a clean line in log–log. Chinchilla tells you how to split the budget: about 20 tokens per parameter.

INPUTSN parameters · D training tokens · C compute (FLOPs)
OUTPUTpredicted test loss L — before you train
THE IDENTITYC ≈ 6 · N · D
CHINCHILLA RULE~20 training tokens per parameter
WHO & WHENKaplan 2020 (OpenAI) · Hoffmann 2022 (DeepMind)
01 / 07
Intuition first · a fixed budget

Bigger oven, or more flour?

You're baking bread with a fixed amount of money. Spend it on a bigger oven (more model parameters — capacity to represent patterns) or on more flour and time (more training tokens — examples to learn from). A scaling law is the recipe that splits the money for the best loaf — and lets you taste it before you bake.

N — THE OVEN

parameters

The model's size and capacity. More parameters can memorize and represent more patterns — but cost compute on every single token, forward and backward.

C — THE MONEY ★

compute

Total training FLOPs, your fixed budget. Bound to the other two by C ≈ 6·N·D. The only question scaling laws answer: how to spend it.

D — THE FLOUR

training tokens

How much data the model sees. More tokens lower loss too — but most GPT-3-era models were starved: too big for the data they got.

For most of deep learning you couldn't say in advance how good a bigger model would be — you trained it and hoped. Scaling laws ended the guessing. Across more than seven orders of magnitude of compute, test loss behaves like a clean function of N, D, and C. Straight lines extrapolate. That is the magic.

02 / 07
The core shape · a straight line in log–log

Loss falls in a power law.

Plot loss against compute on log–log axes and you get a nearly straight line. Drag the exponent and watch how steeply the curve bends — and why the real exponents (~0.05–0.1) mean brutal diminishing returns.

FIG.02 — L(N) = (Nᴄ / N)ᵅ · LINEAR vs LOG–LOG
α 0.02 α 0.40
L(N) ≈ (Nᴄ / N)^α_N α_N ≈ 0.076 L(D) ≈ (Dᴄ / D)^α_D α_D ≈ 0.095

Read the exponents physically: multiplying N by 10 lowers loss by a factor ≈ 0.84 (10⁻⁰·⁰⁷⁶); multiplying D by 10 lowers it by ≈ 0.80. Small exponents — this is diminishing returns baked into the math.

You never get linear improvement. You get a slowly bending log–log line. Kaplan also found that shape barely matters within a wide band — width vs depth, aspect ratio — the law is about scale, not architecture.

03 / 07
The product is forecastability · fit small, predict large

Train tiny. Forecast the giant.

This is the whole reason the field could justify nine-figure runs. Click to place a few cheap "anchor" models on the log–log plot. The line fits itself, then extrapolates to forecast a model far off the right edge — before a single GPU spins up on it.

FIG.03 — CLICK TO ADD ANCHOR RUNS · FIT · EXTRAPOLATE
0 anchor runs
ANCHOR RUNS PLACED0
FITTED EXPONENT α
FIT QUALITY (R²)
FORECAST @ 1000× COMPUTE

Each dot is a small model you actually trained — cheap. The straight line through them is the scaling law for your data pipeline. Extend it rightward and the ★ forecast tells you the loss of the run you can't afford to gamble on.

In a real run, the fitted L is your predicted loss. If the live curve drifts off the line, that's not a discovery — it's a bug in your training run.

04 / 07
The Chinchilla optimization · split a fixed budget

One budget. Where does it go?

Fix the compute budget so C = 6·N·D is constant. Now slide N up and D must come down to pay for it. Drag the split — the loss curve below is the real Chinchilla function, and the dashed line marks the compute-optimal minimum at ~20 tokens per parameter.

FIG.04 — ISOFLOP · LOSS L(N,D) ALONG A FIXED-COMPUTE SLICE
tiny model
huge data
huge model
tiny data
COMPUTE BUDGET Cfixed
YOUR SPLIT · TOKENS/PARAM20
MODEL SIZE N
TRAINING TOKENS D
PREDICTED LOSS L
GAP TO OPTIMUM
// Chinchilla fitted loss L = E + A/N + B/D E=1.69 A=406.4 α=0.34 B=410.7 β=0.28
05 / 07
Two papers · the famous disagreement

The recipe Chinchilla fixed.

Kaplan (2020) said: build a huge model, feed it modest data, stop before convergence — and that produced GPT-3 (175B params on only ~300B tokens). Chinchilla (2022) re-ran it carefully and said: scale model and data equally. Toggle the recipe and watch the budget re-allocate.

FIG.05 — SAME COMPUTE BUDGET · TWO ALLOCATION RECIPES
Chinchilla 2022 Kaplan 2020
N_opt scales asC^0.50
D_opt scales asC^0.50
THE BIG CORRECTION

undertrained giants

Most GPT-3-era models were too big for their data. Chinchilla's 70B, trained on 4× more data than the 280B Gopher at equal compute, beat Gopher, GPT-3 and the 530B Megatron-Turing.

WHY THEY DIFFERED

experimental hygiene

Not philosophy — protocol. Kaplan used a fixed cosine-decay length across runs (under-decaying short runs) and counted embedding params differently. Fix both and the laws converge.

SMALLER, CHEAPER

a 7.5× win

A 70B beating a 530B is a 7.5×-smaller model winning — and far cheaper to fine-tune and serve. Chinchilla reframed the entire cost equation of frontier AI.

06 / 07
The loudest debate · cliffs or curves?

Emergence, or a mirage?

Loss falls smoothly — but do capabilities appear suddenly? Wei et al. (2022) found tasks where models score near-random, then sharply jump past some scale. Schaeffer et al. (2023) pushed back: that cliff may be an artifact of the metric. Flip the metric on the same models and watch the cliff melt into a ramp.

FIG.06 — SAME MODELS · TWO METRICS · ONE OF THEM LIES
Harsh metric (exact-match) Smooth metric (log-likelihood)
THE MIRAGE MECHANISM

A harsh, all-or-nothing metric (exact-match on a long answer) manufactures a sharp cliff: get one token wrong, score zero. A smooth metric (token-level log-likelihood, edit distance) on the same models reveals steady, gradual improvement. Swap the metric and the "emergence" can evaporate.

Schaeffer et al. even conjured fake emergence in vision tasks by choosing a nonlinear metric. But it's not settled: some abilities still look abrupt under smooth metrics, and "smooth underlying competence" doesn't make the practical threshold — below ~60B it just can't do this task — any less real to a user.

07 / 07
Honest caveats · the map is not the territory

What scaling laws don't promise.

These are empirical curve fits — only as trustworthy as the protocol underneath. Chinchilla itself used three estimation routes that should agree, and a 2024 replication found the original's precision was overstated. The ~20:1 conclusion survived; treat the exact coefficients as soft.

Estimation methodHow it worksResultStatus
1 · Training curvesfix model sizes, vary tokens, read each curve's minimum~20 tokens/parambroadly agrees
2 · IsoFLOP profilesfix compute, sweep model size, find the loss-minimizing N per FLOP level~20 tokens/parambroadly agrees
3 · Parametric L(N,D)fit E + A/Nᵅ + B/Dᵝ jointly to all runs~20 tokens/paramCIs were too narrow*

* Epoch AI (Besiroglu et al. 2024) found the parametric method's confidence intervals implausibly narrow — "intervals this narrow would require over 600,000 experiments" when the authors likely ran fewer than 500. Their re-fit reconciled it with methods 1 and 2. The conclusion holds; the original precision did not.

COMPUTE-OPTIMAL ≠ DEPLOYMENT-OPTIMAL

Chinchilla minimizes loss for a fixed training budget. But you train once and serve forever. If inference dominates lifetime cost — it usually does — you should deliberately overtrain a smaller model far past 20:1, Llama-style, to get something cheaper to serve. Chinchilla answers the training question, not the total-cost question.

And it's a law of loss, not of intelligence. Lower cross-entropy correlates with better behavior but does not equal reasoning, factuality, or alignment. The curve doesn't optimize those.

THE DATA WALL

~20:1 assumes you have the tokens. Frontier models now want trillions of unique, high-quality tokens — and the open web is finite. That hard wall drives synthetic data, multi-epoch training (which breaks the single-epoch assumption), and the pivot to test-time-compute reasoning: buy capability with inference compute instead of more pretraining data.

Architecture-independence has limits: MoE, state-space models, and attention variants each have their own scaling constants. The dense-transformer law is not universal. The exponents held 2020–2024 — but a power law carries no guarantee three orders of magnitude further out. It could bend.

01 · 06 — you made it

You can budget
a frontier run.

The power law, the C = 6ND identity, the Chinchilla loss, the ~20:1 rule, Kaplan's correction, the emergence debate, and the honest caveats. You can now look at a compute budget and say where the money goes — and why predictability, not any single number, is the real product.

01·02 Tokenization · text → integers · BPE, byte-level, the split ✓ done
01·05 Attention & transformers · the architecture C = 6ND counts ✓ done
01·06 Scaling laws · power law, Chinchilla, ~20 tokens/param ✓ complete
01·07 Mixture-of-Experts · cheating the dense FLOP cost scaling imposes next
Next · 01 · 07

Mixture-of-Experts →

Scaling laws make dense compute expensive. MoE decouples total parameters from active FLOPs — more capacity, same cost per token. The first great cheat.

openalicelabs