openalicelabs / academy
COURSE ARCH-02 LESSON 02 · 06 TOPIC MIXTURE-OF-EXPERTS EST. READ ~13 MIN
OPENALICE LABORATORIES · EDUCATION PATH · ARCHITECTURE 02 · 06

Many small
experts,
one router.

A dense network makes every token pay the full cost of all its knowledge. A Mixture-of-Experts layer splits that one big feed-forward network into many smaller experts, and a tiny learned router sends each token to just a couple of them. Huge total capacity, tiny active compute — the trick behind Mixtral and DeepSeek-V3.

FIG.00 — TOKEN → TOP-k EXPERTS
loading…
FIG.0A — THE WHOLE TRICK · router picks k of N · only those experts run · gated sum out

One token vector flows in. The router scores all N experts, keeps the top-k, and zeroes the rest. Only the chosen experts compute. Their outputs are combined as a gate-weighted sum — the dormant experts cost nothing this turn.

WHAT IT REPLACESthe single feed-forward (FFN) sub-block · attention untouched
EXPERTS · N8 (Mixtral) → 257 (DeepSeek-V3) parallel FFNs
ACTIVE · k1 (Switch) · 2 (Mixtral) · 8 (DeepSeek-V3)
DECOUPLEStotal parameters from per-token FLOPs
THE HARD PARTload balancing — keep the router smart AND fair
01 / 08
Intuition first · the hospital

One genius doctor, or a roster of specialists?

A dense network is one doctor who memorised all of medicine and sees every patient. A Mixture-of-Experts hospital puts a triage nurse at the door and a roster of specialists behind it — and every patient only pays for the two doctors they actually see.

THE NURSE

the router / gate

A tiny learned layer glances at each token and decides which specialists it needs. Cheap to run, but its judgment is now critical — a bad nurse starves the hospital.

THE SPECIALISTS ★

the experts

Many smaller feed-forward networks. The hospital collectively knows the sum of all of them, but any one visit touches only the top-k.

THE WIN

knowledge ≠ cost

Add specialists to grow expertise without making any single visit slower. Total parameters and per-token FLOPs are now decoupled — conditional, sparse computation.

DON'T CONFUSE IT WITH MIXTURE-OF-AGENTS

MoE routes tokens to weights — sub-networks inside one model, in a single forward pass. Mixture-of-Agents routes prompts to whole models — several complete LLMs answer and get aggregated. Both have "a router" and "experts," but they live at completely different abstraction levels.

The nurse analogy also exposes the catch that drives half of MoE research: if the nurse sends everyone to the cardiologist, the other specialists never learn and the cardiologist is overwhelmed. That rich-get-richer collapse is the failure mode every MoE has to fight — you'll watch it happen, and get fixed, in §04.

Concretely: an MoE layer is a drop-in swap for the standard feed-forward block of a transformer. Attention stays exactly as it is — only the FFN becomes N experts + a router.

02 / 08
The mechanism · a gated sum

Swap one FFN for N FFNs plus a gate.

Mathematically an MoE layer is just a weighted sum over experts. The whole magic is that the gate G(x) is sparse — almost every entry is exactly zero, so almost every expert is skipped.

y = Σi=1..N G(x)i · FFNi(x) // G(x) is SPARSE — most entries are 0 // if G(x)ᵢ = 0 → FFNᵢ never runs

Compare to a dense layer, which is just y = FFN(x) — one network, run in full, every time. The MoE form looks heavier, but because almost all G(x)ᵢ are zero, you only ever evaluate the k experts with nonzero gate. That is where the FLOPs savings come from.

And there's always a residual connection wrapping the layer — so even a token whose experts get dropped (more on that in §04) still passes through unchanged. The model never loses the token entirely.

FIG.02 — DENSE FFN vs MoE LAYER · same slot in the transformer block

Left: one fat FFN, always fully evaluated. Right: a router taps just 2 of the experts — the greyed, dashed experts are resident in memory but idle this token.

03 / 08
The core algorithm · noisy top-k gating

Drive the router yourself.

The router is a tiny linear layer + softmax. To make it sparse you keep only the top-k logits and send the rest to −∞. Add a touch of noise for exploration and balancing. Pick a token, set k, toggle the noise — watch the gate select experts live.

FIG.03 — LIVE NOISY TOP-k GATE · 8 EXPERTS

router probability per expert · solid bar = selected
TOKEN — its affinity to each expert changes the routing
" the" " 2017" "def " " 日本" " 🍣"
TOP-k · how many experts run per token
k=1 k=4
// noisy top-k gating (Shazeer 2017) 1. H = x·Wg + 𝒩(0,1)·softplus(x·Wn) 2. KeepTopK(H, k) → keep top-k, rest = −∞ 3. G(x) = softmax( KeepTopK(H, k) )
04 / 08
The central problem · rich-get-richer

Watch the router collapse — then fix it.

Left alone, a few experts hog all the traffic: they train faster, become more attractive, attract even more tokens, and the rest atrophy. You pay for N experts but use 2. The cure is an auxiliary load-balancing loss. Run the simulation with the aux-loss off, then on.

FIG.04 — ROUTING SIMULATION · 8 EXPERTS · tokens routed per step
AUX-LOSS OFF — collapse

STEP0
EXPERTS DOING REAL WORK8 / 8
TOKENS DROPPED (over capacity)0
LOAD IMBALANCE (max ÷ mean)1.0×
// auxiliary load-balancing loss Laux = α · N · Σi fi · Pi fi = fraction of tokens routed to expert i (hard) Pi = mean router probability for expert i (soft)

The product fᵢ·Pᵢ is smallest when load is uniform (fᵢ ≈ Pᵢ ≈ 1/N), so minimizing it nudges the router toward an even split without dictating which token goes where. Pᵢ is the differentiable handle gradients actually push on; fᵢ is the discrete count it shadows.

α stays tiny (≈0.01). Too large and the balancing term overrides the real task loss and hurts quality. A second router z-loss penalizes oversized router logits to keep the gate's exp() from overflowing — pure training stability.

When an expert overflows its fixed capacity buffer, the extra tokens are dropped — they skip the MoE and ride the residual through unchanged. A quiet quality leak, and a strong reason balance matters.

05 / 08
State of the art · DeepSeekMoE & V3

Fine-grained, shared, and loss-free.

DeepSeek refined the recipe with three ideas the frontier adopted: split experts smaller, keep a few always on, and balance load without an auxiliary loss at all.

FINE-GRAINED

more, smaller experts

Split each expert into m smaller ones (shrink the hidden dim by 1/m). Active FLOPs are unchanged, but the ways to combine experts explode combinatorially — far richer routing.

SHARED EXPERTS ★

always-on common ground

Reserve a few experts that run for every token, absorbing common knowledge (grammar, general patterns) so the routed experts don't each redundantly relearn it.

LOSS-FREE BALANCE

a bias control loop

V3 drops the aux loss. Each expert gets a learnable bias bᵢ added only to the selection score; a controller nudges it up/down to even out load — no gradient fighting the task.

// DeepSeekMoE layer — shared + routed, with residual ht = ut + Σi=1..Ks FFNi(ut) // shared, always on + Σi=Ks+1..mN gi,t·FFNi(ut) // routed top-(mK−Ks)

DeepSeek-V3 ships 256 routed + 1 shared expert, top-8 routing, and the loss-free balancer — yielding 671B total / 37B active parameters. The MoE is only half the story; it's wrapped with Multi-head Latent Attention (KV-cache compression) and multi-token prediction.

WHY LOSS-FREE WINS

An auxiliary loss is a regularizer fighting the task loss — balance and quality trade off. V3's bias bᵢ only steers who gets picked, never the gate weight that scales the output, so the controller can equalize load with no quality penalty.

Reported results: DeepSeekMoE 16B ≈ Llama-2-7B at ~40% of the compute; the 2B variant matched a GShard 2.9B with 1.5× fewer expert params. Sparsity, spent well.

06 / 08
Build the budget · active vs total

Total params are big. Active params are small.

This is the number that makes MoE matter. Dial N and k and watch total capacity grow while the per-token compute barely moves — then load a real frontier preset and see the gap.

FIG.06 — PARAMETER BUDGET · total (held in VRAM) vs active (FLOPs paid per token)

Dense 7B Mixtral 8×7B DeepSeek-V3 Switch-C
EXPERTS — N
8
ACTIVE — k
2
EXPERTS · N8
ACTIVE · k2
TOTAL PARAMS47B
ACTIVE / TOKEN13B
SPARSITY (active ÷ total)28%
VRAM TO SERVE~94 GB

The catch, baked in: VRAM tracks TOTAL, FLOPs track ACTIVE. MoE buys capacity with memory, not compute — every expert must stay resident even though only k run per token.

07 / 08
The levers · and the honest catches

What sparsity costs you.

Every MoE design knob is a trade. The mental model: MoE buys capacity with memory, not compute. Whether that's a win depends on whether you're FLOP-bound or memory-bound.

LeverWhat it buysWhat it costs
More experts (N)more total capacity / knowledgemore VRAM, harder balancing, more all-to-all comms
Higher top-kricher per-token mixing, better qualitymore active FLOPs + routing overhead
Fine-grained expertscombinatorial routing flexibilitymore routing decisions, scheduling complexity
Shared expertsless redundant relearning of common knowledgea floor of always-paid compute
Higher capacity factorfewer dropped tokenswasted compute / memory on padding
Strong aux-loss αeven expert utilizationdrags on task quality (→ V3's loss-free fix)
THE MEMORY CATCH IS BRUTAL

Only k experts run per token — but all of them must sit in VRAM, because across a batch every expert gets used. Mixtral activates 13B but you must hold the full 47B. MoE saves compute, not memory.

And the routers are finicky: collapse, oscillation, exp()-overflow are all real. Noisy gating, aux losses, z-loss, and loss-free bias control are all patches on a fundamentally non-differentiable, discrete-decision problem — top-k selection isn't differentiable; we train around it.

"EXPERT" IS ASPIRATIONAL

Studies often find routing correlates with surface features (token IDs, syntax) more than clean semantic domains. The hospital-specialist intuition is a teaching aid, not a guarantee of interpretable specialization.

Reasoning vs. knowledge: at matched active-params, MoEs tend to shine on knowledge-heavy tasks and lag dense models on some reasoning tasks — specialists store facts; reasoning seems to want depth and shared computation. Plausible, not fully settled.

02 · 06 — you made it

You understand
the sparse model.

The hospital, the gated sum, noisy top-k routing, load-balancing collapse and its cures, DeepSeek's fine-grained / shared / loss-free tricks, and the active-vs-total budget that makes it all worth it. You now know why a 671B model can cost a 37B model to run. That's the lever frontier scaling rides on.

02·04 Attention & transformers · the block MoE leaves untouched ✓ done
02·06 Mixture-of-Experts · sparse FFN · routing, balancing, DeepSeek ✓ complete
02·07 DeepSeek architecture · MLA + MTP + FP8 around this MoE next
02·08 Scaling laws · why decoupling params from FLOPs is such a big deal locked
Next · 02 · 04

Attention & Transformers →

MoE only replaces the feed-forward block. Go back to the part it leaves untouched: how attention lets every token look at every other token.

openalicelabs