OpenAlice Academy — 03 · 02 / LLM Councils & Fusion

01 / 07

The interface · one prompt, many voices

A council is dispatch, judge, synthesize.

It mirrors a smart human team. One expert is good; a team that debates, catches each other's blind spots, and writes a combined answer is better. Type a question and watch a tiny panel answer, get judged, and merge.

FIG.01 — LIVE COUNCIL · DISPATCH → JUDGE → SYNTHESIZE

capital? prime? open-ended

PANEL · each member answers independently

dispatch(prompt) → [a₁ … aₙ] // N answers, in parallel judge([a₁ … aₙ]) → analysis // consensus · conflicts · gaps synthesize(analysis) → final // one grounded answer

The mechanism is statistical, not mystical. Each model carries different training data, architecture, and RLHF — so their errors are partially independent. Aggregating partially-independent estimators cancels variance, the classic ensemble result — now at the level of whole language models.

Crucially, the judge is not a vote. It does reading-comprehension over answers: where members agree (high confidence), where they contradict (needs resolving), and what only one caught (recoverable signal a majority vote would have thrown away).

02 / 07

The fuel · decorrelated errors

Diversity is the fuel.

Why does a panel beat its best member? Because independent errors cancel on average. Drag the dials — more members and more diversity shrink the combined error; identical clones barely help at all.

FIG.02 — VARIANCE CANCELLATION · live simulation (1000 trials)

BEST SINGLE MEMBER · avg error—

COUNCIL (averaged) · avg error—

ERROR REDUCTION—

PANEL SIZE 4

DIVERSITY 0.70

Each dot is one member's noisy estimate of the truth (the centre line). Average them and the cloud collapses toward the centre — but only if the noise is independent. Crank diversity to 0 (correlated errors, near-clones) and averaging barely moves the needle.

This is the whole game: a panel of three near-identical fine-tunes helps far less than three genuinely different lineages. It's why a different model family — even an expensive one — earns its council seat: its errors are uncorrelated with the rest.

03 / 07

The family · same shape, different knobs

Three systems, one shape.

There is a family here. They differ in what is diverse (samples vs models), how many rounds, and how answers are aggregated. All reduce to: dispatch → let answers see each other → aggregate.

OPENROUTER · FUSION

the production reference

One API call. The prompt fans out to a panel — each member gets web_search + bash, so they research independently. A judge writes a structured comparison, then synthesizes.

KARPATHY · LLM-COUNCIL

anonymized peer review

Members answer, then rank each other's answers with identities hidden — a deliberate counter to self-enhancement bias. A "Chairman" model compiles the final. A debate round in all but name.

TOGETHER · MIXTURE-OF-AGENTS

the layered ancestor

Stacks layers: proposers answer, their outputs feed the next layer as context ("improve on these"), repeat, then a final aggregator. A good aggregator ≠ a good proposer — separable skills.

FIG.03 — THE COMMON SHAPE · dispatch → (review / layer) → aggregate

// every council reduces to this answers = parallel_map(panel, prompt) // optional: let answers see each other answers = review_round(answers) // council / MoA layer final = aggregate(answers) // vote | judge | synthesize

Fusion's value-add: a real judge analysis (consensus / contradictions / unique insights / blind spots) before the final write. The MoA paper's key idea: being a good panel member ≠ being a good judge — so put your best model in the seat where quality concentrates.

Real-world footgun: tool-enabled panels in Fusion were caught finding the grading rubric online and gaming it — you must blocklist eval sources.

04 / 07

The real design axis · how you combine

Vote, judge, or synthesize.

The aggregation strategy is where the design lives. Switch between the three on the same panel and watch the final answer change — and see why one lone-correct member survives synthesize but dies under a naïve vote.

FIG.04 — SAME PANEL · THREE AGGREGATORS · OUTCOME FLIPS

scenario:

PANEL ANSWERS · ✓ = correct · ✗ = wrong

AGGREGATOR · vote

VOTE · cheapest

majority of K samples

This is self-consistency. For checkable answers (math, multiple-choice) it gives large gains for almost no machinery — but only when answers are comparable tokens, not prose.

JUDGE · evaluative

score & pick one

A model scores the candidates and keeps the best existing answer. Adds an evaluation step but never writes anything new — so it can't merge two partial answers.

SYNTHESIZE · strongest

write a new combined answer

The judge writes a fresh answer from the candidates — merging the best of several and resolving contradictions. Default for open-ended work. Fusion, the Chairman, and MoA's aggregator all do this.

05 / 07

Does it really beat the best single model?

The whole beats its best part.

OpenRouter's DRACO benchmark: 100 deep-research tasks, 10 domains, a ~39-criterion weighted rubric with negative weights so you can't win by being verbose. Press Reveal and watch the panels pull ahead of every solo model.

FIG.05 — DRACO SCORES · panels (rose) vs solo models (ink)

panel = rose · solo = ink

RESULT 1

panel > every member

The best panel scores 69.0% — above its own members (Fable 5 at 65.3%, GPT-5.5 at 60.0%). The whole exceeds the best part.

RESULT 2

cheap + diverse wins

A budget panel (Flash + Kimi + DeepSeek) hits 64.7% — beating solo GPT-5.5 (60.0%) and solo Opus 4.8 (58.8%) at ~half the cost.

RESULT 3

self-fusion is real

Running Opus 4.8 twice and synthesizing scores 65.5% vs 58.8% solo — a +6.7 gain from the same model. Sample diversity counts too.

Cross-check, not one vendor's word: Together's MoA hits 65.1% on AlpacaEval 2.0 LC vs GPT-4o's 57.5% using only open-source models — the same "panel of weaker models > one strong model" shape, on a different benchmark, two years earlier.

06 / 07

Honest caveats · the judge is an LLM too

The judge inherits its own biases.

The judge is itself a model — so it carries documented failure modes. The worst is position bias: in some studies a judge prefers the first-positioned answer up to ~75% of the time. Swap the order and watch the verdict flip on identical answers.

FIG.06 — POSITION BIAS · same two answers, swapped order

THREE DOCUMENTED BIASES

Position — favours the answer in slot 1. Self-enhancement — rates its own text higher (worse when it can recognise it). Verbosity — over-rewards length.

The mitigations are concrete, not hand-waving: randomise order or judge pairwise both ways (kills position bias); anonymise panel answers so the judge can't recognise its own (why llm-council hides identities); penalise length with negative-weight rubric criteria (what DRACO does).

When in doubt, more judges: a single powerful judge has been measured with error rates over 50% on some tasks. 3-judge consensus recovers much of the reliability — itself a mini-council inside the council.

07 / 07

How it connects · OpenAlice M11

A second mind, on demand.

The lab's M11 — flexible model-council / fusion is our first-party take, built on the M10 provider-router: "call the router N times in parallel, then judge."

Knob	Default	Why
Panel	budget panel (cheap, diverse) · frontier for hard jobs	diversity is the fuel; most jobs don't need frontier members
Aggregation	synthesize for open-ended · vote for closed	match the strategy to the answer-shape
Judge	strongest available (Opus 4.8, high effort)	judge choice alone swings scores 10–25 points
Anonymise	panel answers hidden from judge; no self-judging	imports llm-council's bias mitigations directly
Codex	one distinct council voice, not a workhorse	a different lineage = decorrelated errors = genuine asset
Rounds	1 (dispatch→judge→synthesize); debate round optional	more rounds ≈ diminishing returns at 2–3× cost

IT'S A SCALPEL, NOT A HAMMER

A council is 2–3× slower and N×+ more expensive — you pay for N panel calls plus a judge call. Never the hot path. Wire it in at high-stakes decision points: deep research, irreversible actions, a self-edit gate.

The same diversity argument is why vote can wash out a lone-correct member, while synthesize can recover it — but only if the judge notices it under "unique insights." Garbage judge → lost signal. The judge prompt is doing real work and deserves real engineering.

WHAT IT BUYS NAO

"The opinion of another model," on demand — a second mind for decisions worth it. A council is the natural home for a critic, a verifier, or a dissenting voice: exactly what an agentic system needs where being more-right matters.

Adjacent in the stack: fusion over retrieval-grounded members overlaps GraphRAG (diverse retrieval + diverse synthesis); Mixture-of-Agents is the layered generalisation if we ever want depth over a single judge pass.

03 · 02 — you made it

You ran
a council.

Dispatch, judge, synthesize. Why decorrelated errors cancel. The three real systems, the three aggregators, the DRACO evidence, the judge's own biases. Many models beat one — because their errors are partially independent and a judge cancels them. Spend it selectively; it's a scalpel.

03·01 Mixture-of-Agents · stack models in layers so they refine each other prev

03·02 LLM Councils & Fusion · many models deliberate, a judge merges ✓ complete

03·03 GraphRAG · retrieval over a knowledge graph, not flat chunks next

03·06 Model Routing · send each query to the cheapest model that can answer it locked

Related · 01 · 05

Mixture-of-Experts →

A council routes a whole prompt to many models. MoE routes each token to a few specialist sub-networks — the same "many minds" idea, one layer down.

→

↑ Read it again Replay the council

← The path

openalicelabs