openalicelabs / academy
COURSE TRAIN-02 LESSON 02 · 05 TOPIC TEST-TIME COMPUTE EST. READ ~14 MIN
OPENALICE LABORATORIES · EDUCATION PATH · TRAINING 02 · 05

Pay for the
model to
think.

There are two ways to make a model answer better: make it bigger (train-time compute), or let it think longer (test-time compute). Reasoning models like o1 and DeepSeek-R1 take the second road — they write a long internal chain of thought, sometimes branch and verify it, and accuracy keeps climbing with the compute you spend.

FIG.00 — SHORT vs LONG ANSWER
loading…
FIG.0A — TWO AXES OF COMPUTE · train bigger ↑ · think longer →

A question flows in. Instead of one instant answer, the model spends tokens reasoning — it can branch into several attempts, check itself, backtrack, and only then commit. Hard questions get more thinking; easy ones get less.

INPUTa hard, ideally verifiable question (math, code, logic)
EXTRA SPENDinference tokens — a chain of thought, paid per query
HEADLINE FACTaccuracy rises ~log-linearly with test-time compute
CANONICAL PROOFDeepSeek-R1 · fully published · trained with GRPO
THE CATCHthinking amplifies latent skill — it can't create it
01 / 07
The reframing · a third scaling axis

Train bigger, or think longer.

For years, "scaling" meant one thing: more parameters, more data, more GPU-hours. Test-time compute adds a second, independent knob you can turn after the model ships — per query, on demand.

TRAIN-TIME

spend once, up front

Bigger model, more tokens, more compute. Expensive, slow, and fixed at ship — every query, trivial or brutal, runs the identical forward pass.

TEST-TIME ★

spend per question

Same weights, but let the model do more work per prompt: reason step by step, try several routes, check itself. Hard gets more; easy gets less.

THE SUBSTITUTION

compute can buy params

Snell et al.: on problems a small model sometimes solves, test-time compute can beat a 14× larger model under matched FLOPs. Inference compute substitutes for scale.

// classic scaling loss ↓ with params · data · train-FLOPs // the new axis acc ↑ with log(test-time compute) per-query, after ship

OpenAI's o1 post showed two log-linear lines at once: accuracy rises with the log of train-time RL compute and with the log of test-time thinking. Two knobs, two clean curves — the headline empirical fact of the whole field.

THE HUMAN ANALOGY

A person who studied more (a bigger model) answers an olympiad problem better. But anyone answers it better given ten minutes of scratch paper instead of an instant reply. Test-time compute is the scratch paper — and a "reasoning model" is one trained to use it well.

The architecture is unchanged — still an ordinary autoregressive transformer. o1, o3, R1, Gemini-thinking, QwQ differ only in behaviour: post-trained to produce long chains of thought that actually help, including self-correction.

02 / 07
Eliciting reasoning · the free lunch

Tokens are a serial scratchpad.

A transformer can only do a bounded amount of computation per token. So writing more reasoning tokens literally buys more compute and more intermediate state. Drag the thinking budget and watch a wrong instant answer become a right, worked one.

FIG.02 — SAME QUESTION · MORE THINKING TOKENS
QUESTION — a clock loses 3 min/hour; after 5 hours, how far behind?

THINKING TOKENS0
MODEinstant guess
ANSWER
BLURT DELIBERATE

With zero thinking tokens the model pattern-matches and blurts — often the plausible-but-wrong number. Give it room to lay out the steps and the same weights reach the right answer, because each written step carries computation into the next.

This is chain-of-thought: prompt "think step by step," and the reasoning tokens act as working memory the forward pass would otherwise lack. No retraining — the cheapest lever there is.

03 / 07
Parallel compute · marginalise the reasoning

Sample many chains. Vote.

One chain of thought can go wrong. Self-consistency samples k independent chains at temperature > 0 and takes the majority vote on the final answer. Right answers concentrate; wrong answers scatter. Add samples and watch the correct answer pull ahead.

FIG.03 — SELF-CONSISTENCY · k SAMPLED CHAINS → MAJORITY VOTE
PROBLEM — "Janet's ducks lay 16 eggs/day; she eats 3, bakes with 4, sells the rest at $2. Daily $?"
samples 0

// reasoning chain r is a latent variable answer* = argmax_a Σ_i 1[ ans(chain_i)==a ] ≈ argmax_a P(a | question) marginalising over paths r

The intuition: a hard problem has many valid routes to one right answer, but many different wrong answers. So correct votes pile up while errors spread thin. Pure sampling — no extra training, no verifier.

Reported gains over greedy CoT: +17.9% GSM8K, +11.0% SVAMP, +12.2% AQuA. This is the temporal cousin of an LLM council — a vote across samples of one model instead of across many models.

04 / 07
Guided search · score the steps, prune the bad ones

A verifier beats a vote.

Majority vote is a weak aggregator. A verifier scores candidate solutions so you can pick the best, or steer a search. An outcome model scores only the final answer; a process model scores every step — letting you kill bad branches early. Run the search below.

FIG.04 — PRM-GUIDED SEARCH · prune branches whose steps score low
depth 0 / 3
ORM — OUTCOME REWARD MODEL

Scores only the final answer — "is this right?" Cheap to label (you just need the gold answer) but gives no credit for partial work, and can reward a right answer reached by a lucky wrong path.

PRM — PROCESS REWARD MODEL

Scores each step — "is this step valid?" OpenAI's Let's Verify Step by Step (PRM800K, ~800k human step labels) showed PRMs beat ORMs at selecting correct MATH solutions, and enable guided search: prune branches whose intermediate steps look bad.

Compute-optimal: the best way to spend a fixed budget depends on difficulty — sequential revision on easy problems, parallel PRM search on hard ones. Choosing adaptively beat naïve best-of-N by >4× in efficiency.

05 / 07
The headline plot · log-x, almost a straight line

Accuracy climbs with log-compute.

The clean empirical signature: plot accuracy against the log of test-time compute and you get a near-straight rising line. Drag the budget to walk along the curve — and watch it saturate, because more tokens are not monotonically better.

FIG.05 — TEST-TIME SCALING · accuracy vs log(compute) · illustrative shape
1000×

COMPUTE BUDGET
EST. ACCURACY
REGION

The shape is log-linear then saturating. Early compute buys big jumps; late compute buys little, then nothing — and can even degrade as the model overthinks and thrashes. The curve here is an illustration of the shape, not a measured benchmark.

Concrete frontier moves it produced: GPT-4o solved ~12–13% of AIME 2024; o1 jumped to ~74% pass@1. DeepSeek-R1 reports 79.8% on AIME 2024, 97.3% on MATH-500, and a Codeforces Elo of 2029 (96.3rd percentile of human competitors).

Read these numbers carefully: small test sets, high variance, and pass@1 vs pass@k vs cons@k tell different stories. The shape is the lesson; the exact heights are benchmark-specific.

06 / 07
Folding reasoning into the weights · DeepSeek-R1

Reward correct. Watch reasoning emerge.

The big leap of o1/R1: instead of needing many samples + a verifier at inference, train the model so one long chain is usually right. DeepSeek-R1 (fully published) does this with GRPO and a rule-based reward. Run a group rollout and see how the advantage is computed.

FIG.06 — GRPO · sample a GROUP, score each, baseline = the group's own mean
PROMPT — "compute 17 × 24" · gold = 408 · reward: 1 if boxed answer correct, else 0
Press Roll to sample 6 completions and compute group-relative advantages.
// group-relative advantage (no critic!) A_i = ( r_i − mean(r) ) / std(r) // clipped PPO surrogate + KL leash J = E[ (1/G) Σ min( ρ_i·A_i, clip(ρ_i,1−ε,1+ε)·A_i ) ] − β · KL(π_θ ‖ π_ref)

GRPO is PPO with the value network deleted. Instead of a learned critic baseline, sample a group of G answers to the same prompt and use the group's own mean as the baseline. Half the memory of PPO, and a clean low-variance signal when you can verify correctness.

THE "AHA MOMENT"

R1-Zero started from base V3 with pure RL, no reasoning data at all — and chains of thought, self-verification, and backtracking emerged on their own. The paper quotes the model writing "Wait, wait. Wait... let's reevaluate this step-by-step." Nobody trained that in; being correct made self-correction instrumentally useful.

Reward is rule-based, not neural: math answer matches gold, code passes tests, plus a format reward for wrapping reasoning in <think>…</think>. Ungameable where a neural critic would be hacked. R1's traces then distil into small dense models cheaply.

07 / 07
Honest caveats · what's published vs inferred

What this isn't.

Test-time compute is real and load-bearing — but it is hyped, and several of its claims are subtle. The discipline is to separate published fact from informed speculation.

AMPLIFIES, ≠ CREATES

thinking has a floor

Test-time compute beats a bigger model only when the base already has non-trivial success. On problems it never gets, no amount of thinking helps.

VERIFIABILITY GATES IT

math & code, not taste

Gains shine where correctness is checkable (gold answer, unit tests). On essays, strategy, and taste there's no oracle — gains shrink and get hard to measure.

o1 IS CLOSED

don't state its internals

Whether o1 runs explicit search or a PRM at inference, and its reward design, are not disclosed. R1 is the load-bearing published evidence the recipe works.

OVER- & UNDER-THINKING

More tokens are not monotonically better. o1-like models sometimes thrash — switching strategies prematurely or padding with empty "thinking" that adds no accuracy. Budget-forcing tricks (suppress the stop token, append "Wait") work, but they also saturate and can degrade.

And "reasoning" is a loaded word. The chains are useful serial computation and self-correction emerges — but whether that's reasoning in a human sense, or sophisticated pattern completion that looks like it, is genuinely open. The written CoT is also not guaranteed faithful to the true cause of the answer.

READ SCALING CURVES SKEPTICALLY

2025 critiques argue some reported test-time scaling is fragile — sensitive to the verifier, the benchmark, and the sample budget — and doesn't always reflect a true internal scaling capability. AIME/MATH are small sets, prone to contamination and variance. Always ask which metric, and demand FLOPs-matched comparisons.

Reward hacking still lurks: rule-based rewards dodge neural-RM exploits but invite their own (gaming the answer format without real reasoning); R1-Zero's language-mixing and readability collapse are symptoms of optimising a narrow reward.

02 · 05 — you made it

You bought
the model time.

Two knobs. The scratchpad. Self-consistency votes, verifier-guided search, the log-compute curve, and GRPO — the rule-based RL that made reasoning emerge. You now know why o1 and R1 got good at math and code, and exactly where the magic stops being magic. Spend compute where it pays.

02·01 RLHF & alignment · preferences → behaviour · DPO, GRPO, RLVR ✓ done
02·05 Test-time compute & reasoning · think longer · CoT, vote, verify, GRPO ✓ complete
02·06 DeepSeek architecture · the model R1 was built on · MLA, MoE, FP8 next
02·07 Scaling laws · params, data, compute — and the third axis you just met locked
Next · 02 · 06

DeepSeek architecture →

You met R1 the reasoner. Now meet the model underneath it — Multi-head Latent Attention, MoE, FP8, and the GRPO recipe in its full architectural context.

openalicelabs