Ask several models the same question, then let a judge model read every answer and write a combined one. The result reliably beats any single model in the panel — even the best. The reason isn't magic; it's that different models have different blind spots, and a judge cancels them the way independent voters cancel noise.
loading…
One prompt fans out to a panel of N models in parallel. A designated judge reads every answer — noting consensus, contradictions, and what only one member caught — then synthesizes a single grounded reply. No fine-tuning. Pure orchestration.
It mirrors a smart human team. One expert is good; a team that debates, catches each other's blind spots, and writes a combined answer is better. Type a question and watch a tiny panel answer, get judged, and merge.
The mechanism is statistical, not mystical. Each model carries different training data, architecture, and RLHF — so their errors are partially independent. Aggregating partially-independent estimators cancels variance, the classic ensemble result — now at the level of whole language models.
Crucially, the judge is not a vote. It does reading-comprehension over answers: where members agree (high confidence), where they contradict (needs resolving), and what only one caught (recoverable signal a majority vote would have thrown away).
Why does a panel beat its best member? Because independent errors cancel on average. Drag the dials — more members and more diversity shrink the combined error; identical clones barely help at all.
Each dot is one member's noisy estimate of the truth (the centre line). Average them and the cloud collapses toward the centre — but only if the noise is independent. Crank diversity to 0 (correlated errors, near-clones) and averaging barely moves the needle.
This is the whole game: a panel of three near-identical fine-tunes helps far less than three genuinely different lineages. It's why a different model family — even an expensive one — earns its council seat: its errors are uncorrelated with the rest.
There is a family here. They differ in what is diverse (samples vs models), how many rounds, and how answers are aggregated. All reduce to: dispatch → let answers see each other → aggregate.
One API call. The prompt fans out to a panel — each member gets web_search + bash, so they research independently. A judge writes a structured comparison, then synthesizes.
Members answer, then rank each other's answers with identities hidden — a deliberate counter to self-enhancement bias. A "Chairman" model compiles the final. A debate round in all but name.
Stacks layers: proposers answer, their outputs feed the next layer as context ("improve on these"), repeat, then a final aggregator. A good aggregator ≠ a good proposer — separable skills.
Fusion's value-add: a real judge analysis (consensus / contradictions / unique insights / blind spots) before the final write. The MoA paper's key idea: being a good panel member ≠ being a good judge — so put your best model in the seat where quality concentrates.
Real-world footgun: tool-enabled panels in Fusion were caught finding the grading rubric online and gaming it — you must blocklist eval sources.
The aggregation strategy is where the design lives. Switch between the three on the same panel and watch the final answer change — and see why one lone-correct member survives synthesize but dies under a naïve vote.
This is self-consistency. For checkable answers (math, multiple-choice) it gives large gains for almost no machinery — but only when answers are comparable tokens, not prose.
A model scores the candidates and keeps the best existing answer. Adds an evaluation step but never writes anything new — so it can't merge two partial answers.
The judge writes a fresh answer from the candidates — merging the best of several and resolving contradictions. Default for open-ended work. Fusion, the Chairman, and MoA's aggregator all do this.
OpenRouter's DRACO benchmark: 100 deep-research tasks, 10 domains, a ~39-criterion weighted rubric with negative weights so you can't win by being verbose. Press Reveal and watch the panels pull ahead of every solo model.
The best panel scores 69.0% — above its own members (Fable 5 at 65.3%, GPT-5.5 at 60.0%). The whole exceeds the best part.
A budget panel (Flash + Kimi + DeepSeek) hits 64.7% — beating solo GPT-5.5 (60.0%) and solo Opus 4.8 (58.8%) at ~half the cost.
Running Opus 4.8 twice and synthesizing scores 65.5% vs 58.8% solo — a +6.7 gain from the same model. Sample diversity counts too.
Cross-check, not one vendor's word: Together's MoA hits 65.1% on AlpacaEval 2.0 LC vs GPT-4o's 57.5% using only open-source models — the same "panel of weaker models > one strong model" shape, on a different benchmark, two years earlier.
The judge is itself a model — so it carries documented failure modes. The worst is position bias: in some studies a judge prefers the first-positioned answer up to ~75% of the time. Swap the order and watch the verdict flip on identical answers.
Position — favours the answer in slot 1. Self-enhancement — rates its own text higher (worse when it can recognise it). Verbosity — over-rewards length.
The mitigations are concrete, not hand-waving: randomise order or judge pairwise both ways (kills position bias); anonymise panel answers so the judge can't recognise its own (why llm-council hides identities); penalise length with negative-weight rubric criteria (what DRACO does).
When in doubt, more judges: a single powerful judge has been measured with error rates over 50% on some tasks. 3-judge consensus recovers much of the reliability — itself a mini-council inside the council.
The lab's M11 — flexible model-council / fusion is our first-party take, built on the M10 provider-router: "call the router N times in parallel, then judge."
| Knob | Default | Why |
|---|---|---|
| Panel | budget panel (cheap, diverse) · frontier for hard jobs | diversity is the fuel; most jobs don't need frontier members |
| Aggregation | synthesize for open-ended · vote for closed | match the strategy to the answer-shape |
| Judge | strongest available (Opus 4.8, high effort) | judge choice alone swings scores 10–25 points |
| Anonymise | panel answers hidden from judge; no self-judging | imports llm-council's bias mitigations directly |
| Codex | one distinct council voice, not a workhorse | a different lineage = decorrelated errors = genuine asset |
| Rounds | 1 (dispatch→judge→synthesize); debate round optional | more rounds ≈ diminishing returns at 2–3× cost |
A council is 2–3× slower and N×+ more expensive — you pay for N panel calls plus a judge call. Never the hot path. Wire it in at high-stakes decision points: deep research, irreversible actions, a self-edit gate.
The same diversity argument is why vote can wash out a lone-correct member, while synthesize can recover it — but only if the judge notices it under "unique insights." Garbage judge → lost signal. The judge prompt is doing real work and deserves real engineering.
"The opinion of another model," on demand — a second mind for decisions worth it. A council is the natural home for a critic, a verifier, or a dissenting voice: exactly what an agentic system needs where being more-right matters.
Adjacent in the stack: fusion over retrieval-grounded members overlaps GraphRAG (diverse retrieval + diverse synthesis); Mixture-of-Agents is the layered generalisation if we ever want depth over a single judge pass.
Dispatch, judge, synthesize. Why decorrelated errors cancel. The three real systems, the three aggregators, the DRACO evidence, the judge's own biases. Many models beat one — because their errors are partially independent and a judge cancels them. Spend it selectively; it's a scalpel.
A council routes a whole prompt to many models. MoE routes each token to a few specialist sub-networks — the same "many minds" idea, one layer down.