OpenAlice Academy — 02 · 06 / Mixture-of-Experts

01 / 08

Intuition first · the hospital

One genius doctor, or a roster of specialists?

A dense network is one doctor who memorised all of medicine and sees every patient. A Mixture-of-Experts hospital puts a triage nurse at the door and a roster of specialists behind it — and every patient only pays for the two doctors they actually see.

THE NURSE

the router / gate

A tiny learned layer glances at each token and decides which specialists it needs. Cheap to run, but its judgment is now critical — a bad nurse starves the hospital.

THE SPECIALISTS ★

the experts

Many smaller feed-forward networks. The hospital collectively knows the sum of all of them, but any one visit touches only the top-k.

THE WIN

knowledge ≠ cost

Add specialists to grow expertise without making any single visit slower. Total parameters and per-token FLOPs are now decoupled — conditional, sparse computation.

DON'T CONFUSE IT WITH MIXTURE-OF-AGENTS

MoE routes tokens to weights — sub-networks inside one model, in a single forward pass. Mixture-of-Agents routes prompts to whole models — several complete LLMs answer and get aggregated. Both have "a router" and "experts," but they live at completely different abstraction levels.

The nurse analogy also exposes the catch that drives half of MoE research: if the nurse sends everyone to the cardiologist, the other specialists never learn and the cardiologist is overwhelmed. That rich-get-richer collapse is the failure mode every MoE has to fight — you'll watch it happen, and get fixed, in §04.

Concretely: an MoE layer is a drop-in swap for the standard feed-forward block of a transformer. Attention stays exactly as it is — only the FFN becomes N experts + a router.

02 / 08

The mechanism · a gated sum

Swap one FFN for N FFNs plus a gate.

Mathematically an MoE layer is just a weighted sum over experts. The whole magic is that the gate G(x) is sparse — almost every entry is exactly zero, so almost every expert is skipped.

y = Σ_i=1..N G(x)_i · FFN_i(x) // G(x) is SPARSE — most entries are 0 // if G(x)ᵢ = 0 → FFNᵢ never runs

Compare to a dense layer, which is just y = FFN(x) — one network, run in full, every time. The MoE form looks heavier, but because almost all G(x)ᵢ are zero, you only ever evaluate the k experts with nonzero gate. That is where the FLOPs savings come from.

And there's always a residual connection wrapping the layer — so even a token whose experts get dropped (more on that in §04) still passes through unchanged. The model never loses the token entirely.

FIG.02 — DENSE FFN vs MoE LAYER · same slot in the transformer block

Left: one fat FFN, always fully evaluated. Right: a router taps just 2 of the experts — the greyed, dashed experts are resident in memory but idle this token.

03 / 08

The core algorithm · noisy top-k gating

Drive the router yourself.

The router is a tiny linear layer + softmax. To make it sparse you keep only the top-k logits and send the rest to −∞. Add a touch of noise for exploration and balancing. Pick a token, set k, toggle the noise — watch the gate select experts live.

FIG.03 — LIVE NOISY TOP-k GATE · 8 EXPERTS

router probability per expert · solid bar = selected

TOKEN — its affinity to each expert changes the routing

" the" " 2017" "def " " 日本" " 🍣"

TOP-k · how many experts run per token

k=1 k=4

// noisy top-k gating (Shazeer 2017) 1. H = x·W_g + 𝒩(0,1)·softplus(x·W_n) 2. KeepTopK(H, k) → keep top-k, rest = −∞ 3. G(x) = softmax( KeepTopK(H, k) )

04 / 08

The central problem · rich-get-richer

Watch the router collapse — then fix it.

Left alone, a few experts hog all the traffic: they train faster, become more attractive, attract even more tokens, and the rest atrophy. You pay for N experts but use 2. The cure is an auxiliary load-balancing loss. Run the simulation with the aux-loss off, then on.

FIG.04 — ROUTING SIMULATION · 8 EXPERTS · tokens routed per step

AUX-LOSS OFF — collapse

STEP0

EXPERTS DOING REAL WORK8 / 8

TOKENS DROPPED (over capacity)0

LOAD IMBALANCE (max ÷ mean)1.0×

// auxiliary load-balancing loss L_aux = α · N · Σ_i f_i · P_i f_i = fraction of tokens routed to expert i (hard) P_i = mean router probability for expert i (soft)

The product fᵢ·Pᵢ is smallest when load is uniform (fᵢ ≈ Pᵢ ≈ 1/N), so minimizing it nudges the router toward an even split without dictating which token goes where. Pᵢ is the differentiable handle gradients actually push on; fᵢ is the discrete count it shadows.

α stays tiny (≈0.01). Too large and the balancing term overrides the real task loss and hurts quality. A second router z-loss penalizes oversized router logits to keep the gate's exp() from overflowing — pure training stability.

When an expert overflows its fixed capacity buffer, the extra tokens are dropped — they skip the MoE and ride the residual through unchanged. A quiet quality leak, and a strong reason balance matters.

05 / 08

State of the art · DeepSeekMoE & V3

Fine-grained, shared, and loss-free.

DeepSeek refined the recipe with three ideas the frontier adopted: split experts smaller, keep a few always on, and balance load without an auxiliary loss at all.

FINE-GRAINED

more, smaller experts

Split each expert into m smaller ones (shrink the hidden dim by 1/m). Active FLOPs are unchanged, but the ways to combine experts explode combinatorially — far richer routing.

SHARED EXPERTS ★

always-on common ground

Reserve a few experts that run for every token, absorbing common knowledge (grammar, general patterns) so the routed experts don't each redundantly relearn it.

LOSS-FREE BALANCE

a bias control loop

V3 drops the aux loss. Each expert gets a learnable bias bᵢ added only to the selection score; a controller nudges it up/down to even out load — no gradient fighting the task.

// DeepSeekMoE layer — shared + routed, with residual h_t = u_t + Σ_i=1..Ks FFN_i(u_t) // shared, always on + Σ_i=Ks+1..mN g_i,t·FFN_i(u_t) // routed top-(mK−Ks)

DeepSeek-V3 ships 256 routed + 1 shared expert, top-8 routing, and the loss-free balancer — yielding 671B total / 37B active parameters. The MoE is only half the story; it's wrapped with Multi-head Latent Attention (KV-cache compression) and multi-token prediction.

WHY LOSS-FREE WINS

An auxiliary loss is a regularizer fighting the task loss — balance and quality trade off. V3's bias bᵢ only steers who gets picked, never the gate weight that scales the output, so the controller can equalize load with no quality penalty.

Reported results: DeepSeekMoE 16B ≈ Llama-2-7B at ~40% of the compute; the 2B variant matched a GShard 2.9B with 1.5× fewer expert params. Sparsity, spent well.

06 / 08

Build the budget · active vs total

Total params are big. Active params are small.

This is the number that makes MoE matter. Dial N and k and watch total capacity grow while the per-token compute barely moves — then load a real frontier preset and see the gap.

FIG.06 — PARAMETER BUDGET · total (held in VRAM) vs active (FLOPs paid per token)

Dense 7B Mixtral 8×7B DeepSeek-V3 Switch-C

EXPERTS — N

ACTIVE — k

EXPERTS · N8

ACTIVE · k2

TOTAL PARAMS47B

ACTIVE / TOKEN13B

SPARSITY (active ÷ total)28%

VRAM TO SERVE~94 GB

The catch, baked in: VRAM tracks TOTAL, FLOPs track ACTIVE. MoE buys capacity with memory, not compute — every expert must stay resident even though only k run per token.

07 / 08

The levers · and the honest catches

What sparsity costs you.

Every MoE design knob is a trade. The mental model: MoE buys capacity with memory, not compute. Whether that's a win depends on whether you're FLOP-bound or memory-bound.

Lever	What it buys	What it costs
More experts (N)	more total capacity / knowledge	more VRAM, harder balancing, more all-to-all comms
Higher top-k	richer per-token mixing, better quality	more active FLOPs + routing overhead
Fine-grained experts	combinatorial routing flexibility	more routing decisions, scheduling complexity
Shared experts	less redundant relearning of common knowledge	a floor of always-paid compute
Higher capacity factor	fewer dropped tokens	wasted compute / memory on padding
Strong aux-loss α	even expert utilization	drags on task quality (→ V3's loss-free fix)

THE MEMORY CATCH IS BRUTAL

Only k experts run per token — but all of them must sit in VRAM, because across a batch every expert gets used. Mixtral activates 13B but you must hold the full 47B. MoE saves compute, not memory.

And the routers are finicky: collapse, oscillation, exp()-overflow are all real. Noisy gating, aux losses, z-loss, and loss-free bias control are all patches on a fundamentally non-differentiable, discrete-decision problem — top-k selection isn't differentiable; we train around it.

"EXPERT" IS ASPIRATIONAL

Studies often find routing correlates with surface features (token IDs, syntax) more than clean semantic domains. The hospital-specialist intuition is a teaching aid, not a guarantee of interpretable specialization.

Reasoning vs. knowledge: at matched active-params, MoEs tend to shine on knowledge-heavy tasks and lag dense models on some reasoning tasks — specialists store facts; reasoning seems to want depth and shared computation. Plausible, not fully settled.

02 · 06 — you made it

You understand
the sparse model.

The hospital, the gated sum, noisy top-k routing, load-balancing collapse and its cures, DeepSeek's fine-grained / shared / loss-free tricks, and the active-vs-total budget that makes it all worth it. You now know why a 671B model can cost a 37B model to run. That's the lever frontier scaling rides on.

02·04 Attention & transformers · the block MoE leaves untouched ✓ done

02·06 Mixture-of-Experts · sparse FFN · routing, balancing, DeepSeek ✓ complete

02·07 DeepSeek architecture · MLA + MTP + FP8 around this MoE next

02·08 Scaling laws · why decoupling params from FLOPs is such a big deal locked

Next · 02 · 04

Attention & Transformers →

MoE only replaces the feed-forward block. Go back to the part it leaves untouched: how attention lets every token look at every other token.

→

↑ Read it again Replay the param dial

← The path

openalicelabs