openalicelabs / academy
COURSE SYS-03 LESSON 03 · 02 TOPIC MIXTURE-OF-AGENTS EST. READ ~12 MIN
OPENALICE LABORATORIES · EDUCATION PATH · SYSTEMS 03 · 02

Many models,
stacked so they
sharpen each other.

One model answers alone, and you get one model's blind spots. Mixture-of-Agents runs a panel of LLMs, feeds all their answers into another panel, feeds those into a final aggregator — a few layers deep. The strange part: a stack of mid-tier open-source models out-reasons a single frontier model.

FIG.00 — QUERY → LAYERS → ANSWER
loading…
FIG.0A — THE ARCHITECTURE · proposers → aggregator → proposers → … → final answer

A query flows in. Each layer's proposers answer in parallel; their answers are glued together as references and handed to the next layer, which answers again — better, because it can see everyone's draft. A final aggregator writes the response.

INPUTone user query
STRUCTUREl layers × n agents per layer
COMBINEsynthesize references — NOT vote / pick
HEADLINEopen-source MoA 65.1% > GPT-4o 57.5% (AlpacaEval 2.0)
WHENinference-time only — no weights change
TAXserial layers → slow first token
01 / 07
The intuition · circulate the draft

Not a committee once — a relay of committees.

Ask one committee a hard question and you get a noisy first draft. Now circulate that draft through another committee — and another — each one reading every prior answer and improving on it. That relay is Mixture-of-Agents.

ONE MODEL

one set of blind spots

A single LLM has fixed weaknesses baked in from training. Whatever it gets wrong, it gets wrong with confidence — there's nobody to catch it.

ONE COUNCIL ROUND

errors partly cancel

Independently trained models fail in partly independent ways. Pool several answers and a synthesizer can cancel the noise — this is the fusion idea.

MoA ★

fusion, stacked

Take the synthesized answer and feed it back as input to another panel. Each layer improves on the last. Iterating gives monotonic gains.

THE UNCOMFORTABLE FINDING — "COLLABORATIVENESS"

An LLM generates a better answer when shown other models' answers — even when those other answers came from weaker models. The aggregator isn't picking a winner; it's doing genuine synthesis, pulling good fragments from many drafts.

That single observation is what makes MoA more than a curiosity. If a model only ever copied the best answer it saw, a panel would be capped at its strongest member. Instead each layer can produce something none of its inputs contained — which is why a stack of mid-tier open models can out-reason one frontier model.

The whole rest of this lesson is mechanism: two roles, the layered formula, the one magic prompt, the numbers, and the one way it breaks.

02 / 07
Proposers & aggregators · the only two jobs

A weak-but-different voice is still worth hearing.

MoA splits every model into one of two roles. The trick is that they want opposite things: proposers are valued for diversity, the aggregator for capability.

FIG.02 — PROPOSER vs AGGREGATOR · click to compare
PROPOSER AGGREGATOR

proposer → many diverse drafts // value = perspective aggregator → one synthesized answer // value = capability

A proposer's job is not to be right. It's to offer a "nuanced and different perspective" — a reference the aggregator can mine. So a weaker model from a different lineage is genuinely useful: different training data = different, partly-independent errors.

The aggregator is the capability-sensitive seat — put your strongest model there. And the same model can play both roles in different layers; a good aggregator is usually a good proposer too.

03 / 07
The layered structure · the formula made live

Build the stack. Watch the cost.

The architecture is l layers × n agents. Drag the dials below and watch the stack redraw — more depth gives better answers but serializes more rounds, and the LLM-call count climbs. Press Run query to send a token down the layers.

FIG.03 — LIVE MoA STACK · drag layers & proposers
flows top → bottom, layer by layer
// output of layer i y_i = [ A_i,j(x_i) for j=1..n ] + x_1 x_i+1 = y_i // becomes next input = apply the Aggregate-and-Synthesize prompt + = re-attach the original query x_1

Two pieces matter. is not a sum — it means "glue the agents' outputs in as numbered references and tell the next model to synthesize them." And every layer re-attaches x₁, the real question, so the chain can never drift off-topic.


LAYERS 3
PROPOSERS 4
LLM CALLS PER QUERY
SERIAL ROUNDS (LATENCY)
EST. QUALITY (illustrative)

The published sweet spot is 3 layers × 6 proposers ("Together MoA"); the 2-layer "MoA-Lite" already matches GPT-4o. Quality here is an illustrative curve — real gains are monotonic-but-diminishing in depth.

04 / 07
Synthesis > selection · the core claim

The aggregator writes — it doesn't pick.

A baseline that just selects the best proposed answer loses to one that writes a combined one. Click fragments from the proposals below and watch the aggregator weave a single answer better than any input — including from answers that were worse overall.

QUERY · "Name three benefits of cycling to work."

Notice the dashed proposal is weaker overall — yet it holds the best line about cost. Synthesis lets the aggregator take that one good fragment.

FIG.04 — AGGREGATOR OUTPUT · synthesized live

FRAGMENTS PULLED0
SOURCED FROM— proposers
MODEsynthesis (write)
SELECTION

capped at the best input

An LLM-ranker that picks one whole answer can never beat its strongest proposal. Good fragments inside weaker answers are thrown away.

DIVERSITY WINS

6 models > 6 samples

Six different proposers beat six samples of one model. Different lineages = more independent errors to cancel.

HONEST COST

longer answers

Synthesis tends to inflate length — MoA slightly loses on conciseness even as it wins on correctness and factuality.

05 / 07
The one prompt that does all the work

The whole magic is one system prompt.

There is no special architecture, no fine-tuning, no new loss. The "fusion" happens entirely because of how the aggregator is asked. This text is identical in the paper, the reference repo, and OpenAlice's own hermes-agent implementation.

FIG.05 — THE AGGREGATE-AND-SYNTHESIZE PROMPT (verbatim)
You have been provided with a set of responses from various open-source models to the latest user query. Your task is to synthesize these responses into a single, high-quality response. It is crucial to critically evaluate the information provided in these responses, recognizing that some of it may be biased or incorrect. Your response should not simply replicate the given answers but should offer a refined, accurate, and comprehensive reply to the instruction. Ensure your response is well-structured, coherent, and adheres to the highest standards of accuracy and reliability. Responses from models: 1. <response one> 2. <response two> ...

References are appended as a numbered list (each prepended with \n{i+1}. ) — and injected only when the list is non-empty, so layer 1 with no references is just a plain answer. The clauses "critically evaluate" and "do not replicate" are exactly what turn concatenation into synthesis.

Keep that "some may be incorrect" clause in mind — in §07 we'll see it is not a reliable defense against a deliberately deceptive reference.

TEMPERATURE SPLIT

diverse in, focused out

Proposers run hot (e.g. 0.6–0.7) for variety; the aggregator runs cooler (~0.4) for focused synthesis.

GRACEFUL DEGRADATION

a dead proposer ≠ a dead call

A minimum-successful-references threshold means one failed proposer doesn't kill the query — the aggregator works with whoever answered.

RE-ATTACH x₁

the question always returns

Every layer re-sees the original user query alongside the references, so a long chain can't drift away from what was actually asked.

06 / 07
The benchmark · open-source beats the frontier

65.1% > 57.5%. No proprietary model.

AlpacaEval 2.0, length-controlled win rate. An MoA built entirely from open-source models beat GPT-4 Omni by +7.6 points — and the single best proposer alone managed only 43.9%. The whole is far more than its parts. Hover the bars.

FIG.06 — ALPACAEVAL 2.0 · LENGTH-CONTROLLED WIN RATE

Bars animate in on scroll. These are LLM-judged win rates — length-controlled to remove the "longer = better" bias, but still "preferred by a judge," not ground-truth correctness. The +7.6 headline is real and replicated; read it as a preference signal.

MT-BENCH

9.25 vs 9.19

Near-saturated, so gains are small but consistent. With GPT-4o as aggregator, MoA reaches 9.40.

FLASK

wins 6 of 7 axes

Beats the best single proposer on robustness, correctness, factuality, completeness, insightfulness, metacognition — loses only on conciseness.

COST

can be cheaper

Open-weight token prices make MoA-Lite match GPT-4o's cost while beating its quality; some configs hit GPT-4-Turbo quality at ~2× lower cost.

ConfigurationProposersLayersAggregatorLC win rate
MoA w/ GPT-4o6 open-source3GPT-4o65.7% ±0.7
MoA (all open)6 open-source3Qwen1.5-110B-Chat65.1% ±0.6
MoA-Litesame 62Qwen1.5-72B-Chat59.3% ±0.2
GPT-4 Omni (May 2024)single model57.5%
Qwen1.5-110B-Chatbest single proposer43.9%
07 / 07
The failure mode · "This Is Your Doge"

One deceptive proposer erases nearly all the gain.

The same property that makes MoA strong — every layer re-reads every prior answer — is its weakness. A 2025 follow-up injected one carefully-instructed deceptive agent into a 6-agent / 3-layer stack. Flip the switch and watch the poison propagate up the layers.

FIG.07 — DECEPTION SIMULATOR · 6 agents × 3 layers
all 6 trusted

STACK PERFORMANCE49.2%
DECEPTIVE PROPOSERS0 of 6
GAIN REMAINING100%
WHY THE "MAY BE INCORRECT" CLAUSE FAILS

The aggregator's instruction to "critically evaluate… some may be incorrect" does not reliably make it reject a confidently-wrong reference. The deceptive answer keeps getting re-presented at every layer, so the corruption propagates upward instead of being filtered out.

The measured drop: 49.2% → 37.9% — essentially all the MoA advantage, gone, from a single poisoned voice. Defenses (a larger/stronger aggregator, limiting how much each agent sees) help only partially.

THE PRODUCTION RULE

Only ever stack models you trust. A MoA panel of untrusted, third-party agents isn't just a quality boost — it's an attack surface. This is the multi-agent twin of prompt-injection discipline.

03 · 02 — you made it

You stacked
a council.

Proposers and aggregators. The layered formula. The one synthesis prompt. The benchmark where open beats frontier, and the single deceptive voice that erases it all. MoA is just fusion, stacked — and a real lever, not an academic flex. You now know how to build a council that sharpens itself.

03·01 Fusion & LLM councils · why a panel of models beats one ✓ done
03·02 Mixture-of-Agents · councils, stacked into layers ✓ complete
03·03 Model routing · picking which model for which task next
03·04 RLHF & alignment · making any single voice worth trusting locked
Next · 03 · 04

RLHF & Alignment →

A council is only as trustworthy as its members. Now see how a single model is tuned to be helpful, honest, and worth seating on the panel.

openalicelabs