OpenAlice Academy — 03 · 04 / Model Routing

01 / 07

The business case · difficulty is skewed

Most queries are easy. That's the whole edge.

The economic premise is empirical and robust: a large fraction of real traffic is easy — chit-chat, formatting, lookups, simple rewrites — and a cheap model answers it just as well. The expensive model only earns its price on the hard tail. Drag the difficulty mix and watch the bill move.

FIG.01 — TRAFFIC HISTOGRAM · always-frontier vs routed cost

EASY SHARE (→ cheap model)—

HARD TAIL (→ frontier)—

ALWAYS-FRONTIER COST—

ROUTED COST—

SAVINGS—

ALL HARD ALL EASY

A system that always calls the frontier model pays the tail price for every token. Routing pays it only on the tail. The further right you drag — the more your real traffic is easy — the bigger the prize.

This is why routing shines on conversational benchmarks (chat is mostly easy) and barely moves on uniformly-hard math: there's no easy traffic to offload. Measure your distribution before believing any headline number.

02 / 07

Three orthogonal mechanisms · they compose

Routing, cascading, and speculation.

These three are lumped together as "cost optimization" but they're orthogonal and stack. They differ in when the decision is made — before generating, after generating, or per token.

ROUTING · BEFORE

predictive · pick once

A learned router reads the query and picks one model before any generation. Cheapest to run — but blind, it bets without seeing output.

CASCADING · AFTER

reactive · escalate

Generate with the cheap model first, score the answer, and escalate only if unsure. Self-correcting — but on hard queries you pay twice.

SPECULATIVE · PER TOKEN

exact · free lunch

A small model drafts tokens; the big model verifies them in one pass. 2–3× faster, identical output — provably the same distribution.

Layer	Decision	When	Cost saved	Quality risk
Routing	which model gets the query?	before generation	up to ~3.7×	router misjudges → wrong model
Cascading	good enough, or escalate?	after each generation	up to ~98%	latency + double-spend on hard queries
Speculative	accept these draft tokens?	per token, in one generation	2–3× latency	none — provably exact

03 / 07

RouteLLM · a win-probability model + one dial

The router is just a threshold.

A learned router scores each query's win-probability for the strong model, then routes by a single threshold α. Drag α — the cost dial — and watch traffic flow toward the cheap or the frontier model, with quality recovered (PGR) and cost tracking live.

FIG.03 — BINARY ROUTER · 24 QUERIES · α THRESHOLD

queries sorted by P(strong wins) — bar = the α cutoff

α=0 · all cheap α=1 · all frontier

α (THRESHOLD)—

→ FRONTIER CALLS—

→ CHEAP CALLS—

PGR (QUALITY RECOVERED)—

COST vs ALWAYS-FRONTIER—

P(win_strong | q) = σ( s(M_strong,q) − s(M_weak,q) ) // a Bradley-Terry sigmoid over query q route → strong if P ≥ α, else → weak

The best RouteLLM router uses matrix factorization borrowed from recommender systems — models × queries, exactly like Netflix's users × movies. It's trained on human preference votes from Chatbot Arena, not hand-coded difficulty rules.

α is the only knob. You calibrate it on a validation set to hit a target like "I'll send 30% of traffic to the frontier model." The headline: ≈3.66× cheaper on MT-Bench at 95% quality, only ≈1.4× on hard math — and routers transferred to an unseen model pair without retraining.

PGR — Performance Gap Recovered = (router − weak) / (strong − weak). 0.0 = weak, 1.0 = strong, 0.5 = halfway to frontier quality. CPT(50%) ≈ 37% on MT-Bench: half the quality gap recovered while sending only 37% of traffic to the frontier model.

04 / 07

FrugalGPT · generate first, then decide

Or: try cheap, escalate if unsure.

A cascade flips the order: hit the cheapest model, score the answer's reliability, and escalate to a stronger model only if it fails the bar. Press Run → on queries of different difficulty and watch the cascade climb — paying as it goes.

FIG.04 — LLM CASCADE · cheap → mid → frontier, with a stop-judge

easy: "what's 2+2?" medium: rewrite a paragraph hard: prove a theorem

QUERY · easy — confidence clears the cheap bar fast

MODELS CALLED—

TOTAL COST PAID—

vs STRAIGHT-TO-FRONTIER—

// one cascade step 1. ask the cheapest remaining model 2. a scorer rates the answer's reliability 3. stop-judge: score ≥ threshold? yes → return · no → escalate ↑ repeat up the ladder

FrugalGPT reports matching GPT-4 at up to 98% lower cost on favourable benchmarks — because the easy majority never leaves the cheap tier. But on a hard query you pay the weak model and the strong one and the scorer: net more expensive than going straight to frontier.

The hard, unsolved part is the deferral estimate — knowing the cheap answer is wrong without an oracle. Trained hidden-state probes give the most reliable confidence; verbalized "I'm 90% sure" is poorly calibrated. A cascade is only as good as its ability to notice its own mistakes.

05 / 07

Speculative decoding · routing at the token level

Draft fast, verify in one pass.

The one technique with zero quality cost. A small draft model guesses several tokens; the big model checks them all in a single parallel forward pass. Accepted tokens come free. Press Decode → and watch the draft-verify-accept loop run, token by token.

FIG.05 — DRAFT (dashed) → VERIFY → ACCEPT (solid) / REJECT (struck)

target model output, built one verify-round at a time

α low · tiny draft α high · aligned draft

ACCEPTANCE RATE α—

TOKENS PRODUCED—

TARGET FORWARD PASSES—

EFFECTIVE SPEEDUP—

draft q guesses γ tokens (fast, serial) target p scores all γ in one pass accept token x with prob min(1, p(x)/q(x)) on first reject → resample, discard the rest

The trick exploits that inference is memory-bandwidth-bound: verifying K tokens in one forward pass costs barely more than generating one. The modified rejection-sampling scheme makes the output provably identical to sampling the target model alone — same temperature, same everything. You reproduce the big model exactly, just faster.

Speedup is governed by the acceptance rate α: more agreement → more tokens per target pass → fewer expensive passes. Reported 2–3× wall-clock with identical outputs. The sweet spot is a same-family draft ~10–20× smaller. The catch: you need your own weights and the memory for two models — it does nothing for a closed API.

06 / 07

OpenAlice's cost ladder · the same idea, generalized

Push every check as far down the ladder as it goes.

OpenAlice practices this philosophy under a different name. The rule — spend the cheapest resource that still answers well — generalizes from model choice to the whole improvement loop. A change enters at the top and falls as far down as it can before any token is spent.

FIG.06 — THE COST LADDER · cheaper rungs first · click a rung

Click a rung to see how it maps to the routing literature.

PER-TURN SELECTION

= binary routing

A RouteLLM-style router could send easy companion-chat turns to a cheaper model, reserving the frontier model for hard reasoning and coding turns.

HEURISTIC DISTILL

= FrugalGPT approximation

Distilling a hot-path LLM call (e.g. mood detection) into a cheap classifier is textbook "LLM approximation" — but soul-adjacent and gated on behavioral-envelope tests.

PROBE GATE

= a cascade

"Try the token-free path; if it can't decide, escalate to a live probe" is a cascade with a deferral judge — OpenAlice's α is the change scope, not a sigmoid.

07 / 07

Honest limits · where the headline numbers break

The router is itself a model with error.

Every saving here is a bet, and the bets can lose. Before you ship a router, internalize the failure modes — they're the difference between cutting cost and silently shipping bad answers.

HEADLINES ARE BEST-CASE

"98% cheaper" / "3.66×" come from favourable distributions (chat, MT-Bench). On uniformly-hard or out-of-distribution traffic the gains shrink toward 1× — and a mis-calibrated router can be worse than always-strong, shipping wrong answers at the weak price.

Confidence estimation is unsolved. Cascades live or die on knowing when the cheap answer is wrong. Verbalized confidence is poorly calibrated; probes need training data and degrade out-of-domain. There is no cheap, general "is this answer good?" oracle yet — that's open research, not a config flag.

ROUTERS DRIFT

A router is trained against specific models. Swap the weak model, change a system prompt, or let the API update silently, and its win-probability estimates rot. Treat the router as a model with a maintenance burden, not a static rule.

"Same quality" hides distribution shift: aggregate quality can match while the failure modes change — a router that offloads to a weak model may be fine on average but systematically worse on a sub-population (e.g. a non-English cohort) the aggregate metric hides. And non-determinism means you need fixed-seed golden traces to measure routers honestly at all.

Technique	Free lunch?	Needs own weights?	Biggest risk
Routing	no — trades quality for cost	no	blind misjudgement → bad answer at weak price
Cascading	no — pays twice on the tail	no	mis-calibrated confidence + P99 latency
Speculative	yes — provably exact	yes (+ memory for 2 models)	collapses if draft ≠ target distribution

03 · 04 — you made it

You can route
a fleet of models.

The skew that makes it pay. Routing before, cascading after, speculation per token. One dial sliding cost against quality, and a router that is itself a model you must maintain. The cheapest resource that still answers well — that's the whole craft.

03·03 Mixture-of-Experts · sparse routing INSIDE one model ✓ done

03·04 Model Routing · routing, cascading, speculative decoding ✓ complete

03·05 Scaling laws · how much compute buys how much quality next

03·06 Flash attention · the IO-aware kernel that makes serving cheap locked

Next · 03 · 05