Not every question needs your most expensive model. Query difficulty is heavily skewed — most real traffic is easy, and a cheap model answers it indistinguishably. Model routing is the engineering of paying the frontier price only on the hard tail.
loading…
A stream of requests arrives. A lightweight router looks at each one and decides: is the cheap model enough, or does this need the frontier model? Get that decision right often enough and you pay the high price only where it matters.
The economic premise is empirical and robust: a large fraction of real traffic is easy — chit-chat, formatting, lookups, simple rewrites — and a cheap model answers it just as well. The expensive model only earns its price on the hard tail. Drag the difficulty mix and watch the bill move.
A system that always calls the frontier model pays the tail price for every token. Routing pays it only on the tail. The further right you drag — the more your real traffic is easy — the bigger the prize.
This is why routing shines on conversational benchmarks (chat is mostly easy) and barely moves on uniformly-hard math: there's no easy traffic to offload. Measure your distribution before believing any headline number.
These three are lumped together as "cost optimization" but they're orthogonal and stack. They differ in when the decision is made — before generating, after generating, or per token.
A learned router reads the query and picks one model before any generation. Cheapest to run — but blind, it bets without seeing output.
Generate with the cheap model first, score the answer, and escalate only if unsure. Self-correcting — but on hard queries you pay twice.
A small model drafts tokens; the big model verifies them in one pass. 2–3× faster, identical output — provably the same distribution.
| Layer | Decision | When | Cost saved | Quality risk |
|---|---|---|---|---|
| Routing | which model gets the query? | before generation | up to ~3.7× | router misjudges → wrong model |
| Cascading | good enough, or escalate? | after each generation | up to ~98% | latency + double-spend on hard queries |
| Speculative | accept these draft tokens? | per token, in one generation | 2–3× latency | none — provably exact |
A learned router scores each query's win-probability for the strong model, then routes by a single threshold α. Drag α — the cost dial — and watch traffic flow toward the cheap or the frontier model, with quality recovered (PGR) and cost tracking live.
The best RouteLLM router uses matrix factorization borrowed from recommender systems — models × queries, exactly like Netflix's users × movies. It's trained on human preference votes from Chatbot Arena, not hand-coded difficulty rules.
α is the only knob. You calibrate it on a validation set to hit a target like "I'll send 30% of traffic to the frontier model." The headline: ≈3.66× cheaper on MT-Bench at 95% quality, only ≈1.4× on hard math — and routers transferred to an unseen model pair without retraining.
PGR — Performance Gap Recovered = (router − weak) / (strong − weak). 0.0 = weak, 1.0 = strong, 0.5 = halfway to frontier quality. CPT(50%) ≈ 37% on MT-Bench: half the quality gap recovered while sending only 37% of traffic to the frontier model.
A cascade flips the order: hit the cheapest model, score the answer's reliability, and escalate to a stronger model only if it fails the bar. Press Run → on queries of different difficulty and watch the cascade climb — paying as it goes.
FrugalGPT reports matching GPT-4 at up to 98% lower cost on favourable benchmarks — because the easy majority never leaves the cheap tier. But on a hard query you pay the weak model and the strong one and the scorer: net more expensive than going straight to frontier.
The hard, unsolved part is the deferral estimate — knowing the cheap answer is wrong without an oracle. Trained hidden-state probes give the most reliable confidence; verbalized "I'm 90% sure" is poorly calibrated. A cascade is only as good as its ability to notice its own mistakes.
The one technique with zero quality cost. A small draft model guesses several tokens; the big model checks them all in a single parallel forward pass. Accepted tokens come free. Press Decode → and watch the draft-verify-accept loop run, token by token.
The trick exploits that inference is memory-bandwidth-bound: verifying K tokens in one forward pass costs barely more than generating one. The modified rejection-sampling scheme makes the output provably identical to sampling the target model alone — same temperature, same everything. You reproduce the big model exactly, just faster.
Speedup is governed by the acceptance rate α: more agreement → more tokens per target pass → fewer expensive passes. Reported 2–3× wall-clock with identical outputs. The sweet spot is a same-family draft ~10–20× smaller. The catch: you need your own weights and the memory for two models — it does nothing for a closed API.
OpenAlice practices this philosophy under a different name. The rule — spend the cheapest resource that still answers well — generalizes from model choice to the whole improvement loop. A change enters at the top and falls as far down as it can before any token is spent.
A RouteLLM-style router could send easy companion-chat turns to a cheaper model, reserving the frontier model for hard reasoning and coding turns.
Distilling a hot-path LLM call (e.g. mood detection) into a cheap classifier is textbook "LLM approximation" — but soul-adjacent and gated on behavioral-envelope tests.
"Try the token-free path; if it can't decide, escalate to a live probe" is a cascade with a deferral judge — OpenAlice's α is the change scope, not a sigmoid.
Every saving here is a bet, and the bets can lose. Before you ship a router, internalize the failure modes — they're the difference between cutting cost and silently shipping bad answers.
"98% cheaper" / "3.66×" come from favourable distributions (chat, MT-Bench). On uniformly-hard or out-of-distribution traffic the gains shrink toward 1× — and a mis-calibrated router can be worse than always-strong, shipping wrong answers at the weak price.
Confidence estimation is unsolved. Cascades live or die on knowing when the cheap answer is wrong. Verbalized confidence is poorly calibrated; probes need training data and degrade out-of-domain. There is no cheap, general "is this answer good?" oracle yet — that's open research, not a config flag.
A router is trained against specific models. Swap the weak model, change a system prompt, or let the API update silently, and its win-probability estimates rot. Treat the router as a model with a maintenance burden, not a static rule.
"Same quality" hides distribution shift: aggregate quality can match while the failure modes change — a router that offloads to a weak model may be fine on average but systematically worse on a sub-population (e.g. a non-English cohort) the aggregate metric hides. And non-determinism means you need fixed-seed golden traces to measure routers honestly at all.
| Technique | Free lunch? | Needs own weights? | Biggest risk |
|---|---|---|---|
| Routing | no — trades quality for cost | no | blind misjudgement → bad answer at weak price |
| Cascading | no — pays twice on the tail | no | mis-calibrated confidence + P99 latency |
| Speculative | yes — provably exact | yes (+ memory for 2 models) | collapses if draft ≠ target distribution |
The skew that makes it pay. Routing before, cascading after, speculation per token. One dial sliding cost against quality, and a router that is itself a model you must maintain. The cheapest resource that still answers well — that's the whole craft.
Routing decides which model. Scaling laws decide how good any given model can be for a fixed compute budget — the curve the whole fleet sits on.