openalicelabs / academy
COURSE SYS-03 LESSON 03 · 04 TOPIC MODEL ROUTING EST. READ ~13 MIN
OPENALICE LABORATORIES · SYSTEMS PATH · SERVING 03 · 04

Send each query
to the
cheapest model
that still wins.

Not every question needs your most expensive model. Query difficulty is heavily skewed — most real traffic is easy, and a cheap model answers it indistinguishably. Model routing is the engineering of paying the frontier price only on the hard tail.

FIG.00 — ONE QUERY · WHICH MODEL?
loading…
FIG.0A — THE PREMISE · easy queries → cheap model · hard tail → frontier model

A stream of requests arrives. A lightweight router looks at each one and decides: is the cheap model enough, or does this need the frontier model? Get that decision right often enough and you pay the high price only where it matters.

ROUTINGpick the model BEFORE generating (predictive)
CASCADINGgenerate cheap, then escalate if unsure (reactive)
SPECULATIVE DECODINGsmall model drafts, big model verifies (per-token)
THE ONE DIALcost ↔ quality, slid along a Pareto frontier
RULE OF THUMBrouting ≈ up to 3.7× cheaper on chat traffic
01 / 07
The business case · difficulty is skewed

Most queries are easy. That's the whole edge.

The economic premise is empirical and robust: a large fraction of real traffic is easy — chit-chat, formatting, lookups, simple rewrites — and a cheap model answers it just as well. The expensive model only earns its price on the hard tail. Drag the difficulty mix and watch the bill move.

FIG.01 — TRAFFIC HISTOGRAM · always-frontier vs routed cost

EASY SHARE (→ cheap model)
HARD TAIL (→ frontier)
ALWAYS-FRONTIER COST
ROUTED COST
SAVINGS
ALL HARD ALL EASY

A system that always calls the frontier model pays the tail price for every token. Routing pays it only on the tail. The further right you drag — the more your real traffic is easy — the bigger the prize.

This is why routing shines on conversational benchmarks (chat is mostly easy) and barely moves on uniformly-hard math: there's no easy traffic to offload. Measure your distribution before believing any headline number.

02 / 07
Three orthogonal mechanisms · they compose

Routing, cascading, and speculation.

These three are lumped together as "cost optimization" but they're orthogonal and stack. They differ in when the decision is made — before generating, after generating, or per token.

ROUTING · BEFORE

predictive · pick once

A learned router reads the query and picks one model before any generation. Cheapest to run — but blind, it bets without seeing output.

CASCADING · AFTER

reactive · escalate

Generate with the cheap model first, score the answer, and escalate only if unsure. Self-correcting — but on hard queries you pay twice.

SPECULATIVE · PER TOKEN

exact · free lunch

A small model drafts tokens; the big model verifies them in one pass. 2–3× faster, identical output — provably the same distribution.

LayerDecisionWhenCost savedQuality risk
Routingwhich model gets the query?before generationup to ~3.7×router misjudges → wrong model
Cascadinggood enough, or escalate?after each generationup to ~98%latency + double-spend on hard queries
Speculativeaccept these draft tokens?per token, in one generation2–3× latencynone — provably exact
03 / 07
RouteLLM · a win-probability model + one dial

The router is just a threshold.

A learned router scores each query's win-probability for the strong model, then routes by a single threshold α. Drag α — the cost dial — and watch traffic flow toward the cheap or the frontier model, with quality recovered (PGR) and cost tracking live.

FIG.03 — BINARY ROUTER · 24 QUERIES · α THRESHOLD
queries sorted by P(strong wins) — bar = the α cutoff
α=0 · all cheap α=1 · all frontier

α (THRESHOLD)
→ FRONTIER CALLS
→ CHEAP CALLS
PGR (QUALITY RECOVERED)
COST vs ALWAYS-FRONTIER
P(win_strong | q) = σ( s(Mstrong,q) − s(Mweak,q) ) // a Bradley-Terry sigmoid over query q route → strong if P ≥ α, else → weak

The best RouteLLM router uses matrix factorization borrowed from recommender systems — models × queries, exactly like Netflix's users × movies. It's trained on human preference votes from Chatbot Arena, not hand-coded difficulty rules.

α is the only knob. You calibrate it on a validation set to hit a target like "I'll send 30% of traffic to the frontier model." The headline: ≈3.66× cheaper on MT-Bench at 95% quality, only ≈1.4× on hard math — and routers transferred to an unseen model pair without retraining.

PGR — Performance Gap Recovered = (router − weak) / (strong − weak). 0.0 = weak, 1.0 = strong, 0.5 = halfway to frontier quality. CPT(50%) ≈ 37% on MT-Bench: half the quality gap recovered while sending only 37% of traffic to the frontier model.

04 / 07
FrugalGPT · generate first, then decide

Or: try cheap, escalate if unsure.

A cascade flips the order: hit the cheapest model, score the answer's reliability, and escalate to a stronger model only if it fails the bar. Press Run → on queries of different difficulty and watch the cascade climb — paying as it goes.

FIG.04 — LLM CASCADE · cheap → mid → frontier, with a stop-judge
easy: "what's 2+2?" medium: rewrite a paragraph hard: prove a theorem
QUERY · easy — confidence clears the cheap bar fast

MODELS CALLED
TOTAL COST PAID
vs STRAIGHT-TO-FRONTIER
// one cascade step 1. ask the cheapest remaining model 2. a scorer rates the answer's reliability 3. stop-judge: score ≥ threshold? yes → return · no → escalate ↑ repeat up the ladder

FrugalGPT reports matching GPT-4 at up to 98% lower cost on favourable benchmarks — because the easy majority never leaves the cheap tier. But on a hard query you pay the weak model and the strong one and the scorer: net more expensive than going straight to frontier.

The hard, unsolved part is the deferral estimate — knowing the cheap answer is wrong without an oracle. Trained hidden-state probes give the most reliable confidence; verbalized "I'm 90% sure" is poorly calibrated. A cascade is only as good as its ability to notice its own mistakes.

05 / 07
Speculative decoding · routing at the token level

Draft fast, verify in one pass.

The one technique with zero quality cost. A small draft model guesses several tokens; the big model checks them all in a single parallel forward pass. Accepted tokens come free. Press Decode → and watch the draft-verify-accept loop run, token by token.

FIG.05 — DRAFT (dashed) → VERIFY → ACCEPT (solid) / REJECT (struck)
target model output, built one verify-round at a time
α low · tiny draft α high · aligned draft

ACCEPTANCE RATE α
TOKENS PRODUCED
TARGET FORWARD PASSES
EFFECTIVE SPEEDUP
draft q guesses γ tokens (fast, serial) target p scores all γ in one pass accept token x with prob min(1, p(x)/q(x)) on first reject → resample, discard the rest

The trick exploits that inference is memory-bandwidth-bound: verifying K tokens in one forward pass costs barely more than generating one. The modified rejection-sampling scheme makes the output provably identical to sampling the target model alone — same temperature, same everything. You reproduce the big model exactly, just faster.

Speedup is governed by the acceptance rate α: more agreement → more tokens per target pass → fewer expensive passes. Reported 2–3× wall-clock with identical outputs. The sweet spot is a same-family draft ~10–20× smaller. The catch: you need your own weights and the memory for two models — it does nothing for a closed API.

06 / 07
OpenAlice's cost ladder · the same idea, generalized

Push every check as far down the ladder as it goes.

OpenAlice practices this philosophy under a different name. The rule — spend the cheapest resource that still answers well — generalizes from model choice to the whole improvement loop. A change enters at the top and falls as far down as it can before any token is spent.

FIG.06 — THE COST LADDER · cheaper rungs first · click a rung

Click a rung to see how it maps to the routing literature.
PER-TURN SELECTION

= binary routing

A RouteLLM-style router could send easy companion-chat turns to a cheaper model, reserving the frontier model for hard reasoning and coding turns.

HEURISTIC DISTILL

= FrugalGPT approximation

Distilling a hot-path LLM call (e.g. mood detection) into a cheap classifier is textbook "LLM approximation" — but soul-adjacent and gated on behavioral-envelope tests.

PROBE GATE

= a cascade

"Try the token-free path; if it can't decide, escalate to a live probe" is a cascade with a deferral judge — OpenAlice's α is the change scope, not a sigmoid.

07 / 07
Honest limits · where the headline numbers break

The router is itself a model with error.

Every saving here is a bet, and the bets can lose. Before you ship a router, internalize the failure modes — they're the difference between cutting cost and silently shipping bad answers.

HEADLINES ARE BEST-CASE

"98% cheaper" / "3.66×" come from favourable distributions (chat, MT-Bench). On uniformly-hard or out-of-distribution traffic the gains shrink toward 1× — and a mis-calibrated router can be worse than always-strong, shipping wrong answers at the weak price.

Confidence estimation is unsolved. Cascades live or die on knowing when the cheap answer is wrong. Verbalized confidence is poorly calibrated; probes need training data and degrade out-of-domain. There is no cheap, general "is this answer good?" oracle yet — that's open research, not a config flag.

ROUTERS DRIFT

A router is trained against specific models. Swap the weak model, change a system prompt, or let the API update silently, and its win-probability estimates rot. Treat the router as a model with a maintenance burden, not a static rule.

"Same quality" hides distribution shift: aggregate quality can match while the failure modes change — a router that offloads to a weak model may be fine on average but systematically worse on a sub-population (e.g. a non-English cohort) the aggregate metric hides. And non-determinism means you need fixed-seed golden traces to measure routers honestly at all.

TechniqueFree lunch?Needs own weights?Biggest risk
Routingno — trades quality for costnoblind misjudgement → bad answer at weak price
Cascadingno — pays twice on the tailnomis-calibrated confidence + P99 latency
Speculativeyes — provably exactyes (+ memory for 2 models)collapses if draft ≠ target distribution
03 · 04 — you made it

You can route
a fleet of models.

The skew that makes it pay. Routing before, cascading after, speculation per token. One dial sliding cost against quality, and a router that is itself a model you must maintain. The cheapest resource that still answers well — that's the whole craft.

03·03 Mixture-of-Experts · sparse routing INSIDE one model ✓ done
03·04 Model Routing · routing, cascading, speculative decoding ✓ complete
03·05 Scaling laws · how much compute buys how much quality next
03·06 Flash attention · the IO-aware kernel that makes serving cheap locked
Next · 03 · 05

Scaling laws →

Routing decides which model. Scaling laws decide how good any given model can be for a fixed compute budget — the curve the whole fleet sits on.

openalicelabs