There are two ways to make a model answer better: make it bigger (train-time compute), or let it think longer (test-time compute). Reasoning models like o1 and DeepSeek-R1 take the second road — they write a long internal chain of thought, sometimes branch and verify it, and accuracy keeps climbing with the compute you spend.
loading…
A question flows in. Instead of one instant answer, the model spends tokens reasoning — it can branch into several attempts, check itself, backtrack, and only then commit. Hard questions get more thinking; easy ones get less.
For years, "scaling" meant one thing: more parameters, more data, more GPU-hours. Test-time compute adds a second, independent knob you can turn after the model ships — per query, on demand.
Bigger model, more tokens, more compute. Expensive, slow, and fixed at ship — every query, trivial or brutal, runs the identical forward pass.
Same weights, but let the model do more work per prompt: reason step by step, try several routes, check itself. Hard gets more; easy gets less.
Snell et al.: on problems a small model sometimes solves, test-time compute can beat a 14× larger model under matched FLOPs. Inference compute substitutes for scale.
OpenAI's o1 post showed two log-linear lines at once: accuracy rises with the log of train-time RL compute and with the log of test-time thinking. Two knobs, two clean curves — the headline empirical fact of the whole field.
A person who studied more (a bigger model) answers an olympiad problem better. But anyone answers it better given ten minutes of scratch paper instead of an instant reply. Test-time compute is the scratch paper — and a "reasoning model" is one trained to use it well.
The architecture is unchanged — still an ordinary autoregressive transformer. o1, o3, R1, Gemini-thinking, QwQ differ only in behaviour: post-trained to produce long chains of thought that actually help, including self-correction.
A transformer can only do a bounded amount of computation per token. So writing more reasoning tokens literally buys more compute and more intermediate state. Drag the thinking budget and watch a wrong instant answer become a right, worked one.
With zero thinking tokens the model pattern-matches and blurts — often the plausible-but-wrong number. Give it room to lay out the steps and the same weights reach the right answer, because each written step carries computation into the next.
This is chain-of-thought: prompt "think step by step," and the reasoning tokens act as working memory the forward pass would otherwise lack. No retraining — the cheapest lever there is.
One chain of thought can go wrong. Self-consistency samples k independent chains at temperature > 0 and takes the majority vote on the final answer. Right answers concentrate; wrong answers scatter. Add samples and watch the correct answer pull ahead.
The intuition: a hard problem has many valid routes to one right answer, but many different wrong answers. So correct votes pile up while errors spread thin. Pure sampling — no extra training, no verifier.
Reported gains over greedy CoT: +17.9% GSM8K, +11.0% SVAMP, +12.2% AQuA. This is the temporal cousin of an LLM council — a vote across samples of one model instead of across many models.
Majority vote is a weak aggregator. A verifier scores candidate solutions so you can pick the best, or steer a search. An outcome model scores only the final answer; a process model scores every step — letting you kill bad branches early. Run the search below.
Scores only the final answer — "is this right?" Cheap to label (you just need the gold answer) but gives no credit for partial work, and can reward a right answer reached by a lucky wrong path.
Scores each step — "is this step valid?" OpenAI's Let's Verify Step by Step (PRM800K, ~800k human step labels) showed PRMs beat ORMs at selecting correct MATH solutions, and enable guided search: prune branches whose intermediate steps look bad.
Compute-optimal: the best way to spend a fixed budget depends on difficulty — sequential revision on easy problems, parallel PRM search on hard ones. Choosing adaptively beat naïve best-of-N by >4× in efficiency.
The clean empirical signature: plot accuracy against the log of test-time compute and you get a near-straight rising line. Drag the budget to walk along the curve — and watch it saturate, because more tokens are not monotonically better.
The shape is log-linear then saturating. Early compute buys big jumps; late compute buys little, then nothing — and can even degrade as the model overthinks and thrashes. The curve here is an illustration of the shape, not a measured benchmark.
Concrete frontier moves it produced: GPT-4o solved ~12–13% of AIME 2024; o1 jumped to ~74% pass@1. DeepSeek-R1 reports 79.8% on AIME 2024, 97.3% on MATH-500, and a Codeforces Elo of 2029 (96.3rd percentile of human competitors).
Read these numbers carefully: small test sets, high variance, and pass@1 vs pass@k vs cons@k tell different stories. The shape is the lesson; the exact heights are benchmark-specific.
The big leap of o1/R1: instead of needing many samples + a verifier at inference, train the model so one long chain is usually right. DeepSeek-R1 (fully published) does this with GRPO and a rule-based reward. Run a group rollout and see how the advantage is computed.
GRPO is PPO with the value network deleted. Instead of a learned critic baseline, sample a group of G answers to the same prompt and use the group's own mean as the baseline. Half the memory of PPO, and a clean low-variance signal when you can verify correctness.
R1-Zero started from base V3 with pure RL, no reasoning data at all — and chains of thought, self-verification, and backtracking emerged on their own. The paper quotes the model writing "Wait, wait. Wait... let's reevaluate this step-by-step." Nobody trained that in; being correct made self-correction instrumentally useful.
Reward is rule-based, not neural: math answer matches gold, code passes tests, plus a format reward for wrapping reasoning in <think>…</think>. Ungameable where a neural critic would be hacked. R1's traces then distil into small dense models cheaply.
Test-time compute is real and load-bearing — but it is hyped, and several of its claims are subtle. The discipline is to separate published fact from informed speculation.
Test-time compute beats a bigger model only when the base already has non-trivial success. On problems it never gets, no amount of thinking helps.
Gains shine where correctness is checkable (gold answer, unit tests). On essays, strategy, and taste there's no oracle — gains shrink and get hard to measure.
Whether o1 runs explicit search or a PRM at inference, and its reward design, are not disclosed. R1 is the load-bearing published evidence the recipe works.
More tokens are not monotonically better. o1-like models sometimes thrash — switching strategies prematurely or padding with empty "thinking" that adds no accuracy. Budget-forcing tricks (suppress the stop token, append "Wait") work, but they also saturate and can degrade.
And "reasoning" is a loaded word. The chains are useful serial computation and self-correction emerges — but whether that's reasoning in a human sense, or sophisticated pattern completion that looks like it, is genuinely open. The written CoT is also not guaranteed faithful to the true cause of the answer.
2025 critiques argue some reported test-time scaling is fragile — sensitive to the verifier, the benchmark, and the sample budget — and doesn't always reflect a true internal scaling capability. AIME/MATH are small sets, prone to contamination and variance. Always ask which metric, and demand FLOPs-matched comparisons.
Reward hacking still lurks: rule-based rewards dodge neural-RM exploits but invite their own (gaming the answer format without real reasoning); R1-Zero's language-mixing and readability collapse are symptoms of optimising a narrow reward.
Two knobs. The scratchpad. Self-consistency votes, verifier-guided search, the log-compute curve, and GRPO — the rule-based RL that made reasoning emerge. You now know why o1 and R1 got good at math and code, and exactly where the magic stops being magic. Spend compute where it pays.
You met R1 the reasoner. Now meet the model underneath it — Multi-head Latent Attention, MoE, FP8, and the GRPO recipe in its full architectural context.