A dense network makes every token pay the full cost of all its knowledge. A Mixture-of-Experts layer splits that one big feed-forward network into many smaller experts, and a tiny learned router sends each token to just a couple of them. Huge total capacity, tiny active compute — the trick behind Mixtral and DeepSeek-V3.
loading…
One token vector flows in. The router scores all N experts, keeps the top-k, and zeroes the rest. Only the chosen experts compute. Their outputs are combined as a gate-weighted sum — the dormant experts cost nothing this turn.
A dense network is one doctor who memorised all of medicine and sees every patient. A Mixture-of-Experts hospital puts a triage nurse at the door and a roster of specialists behind it — and every patient only pays for the two doctors they actually see.
A tiny learned layer glances at each token and decides which specialists it needs. Cheap to run, but its judgment is now critical — a bad nurse starves the hospital.
Many smaller feed-forward networks. The hospital collectively knows the sum of all of them, but any one visit touches only the top-k.
Add specialists to grow expertise without making any single visit slower. Total parameters and per-token FLOPs are now decoupled — conditional, sparse computation.
MoE routes tokens to weights — sub-networks inside one model, in a single forward pass. Mixture-of-Agents routes prompts to whole models — several complete LLMs answer and get aggregated. Both have "a router" and "experts," but they live at completely different abstraction levels.
The nurse analogy also exposes the catch that drives half of MoE research: if the nurse sends everyone to the cardiologist, the other specialists never learn and the cardiologist is overwhelmed. That rich-get-richer collapse is the failure mode every MoE has to fight — you'll watch it happen, and get fixed, in §04.
Concretely: an MoE layer is a drop-in swap for the standard feed-forward block of a transformer. Attention stays exactly as it is — only the FFN becomes N experts + a router.
Mathematically an MoE layer is just a weighted sum over experts. The whole magic is that the gate G(x) is sparse — almost every entry is exactly zero, so almost every expert is skipped.
Compare to a dense layer, which is just y = FFN(x) — one network, run in full, every time. The MoE form looks heavier, but because almost all G(x)ᵢ are zero, you only ever evaluate the k experts with nonzero gate. That is where the FLOPs savings come from.
And there's always a residual connection wrapping the layer — so even a token whose experts get dropped (more on that in §04) still passes through unchanged. The model never loses the token entirely.
Left: one fat FFN, always fully evaluated. Right: a router taps just 2 of the experts — the greyed, dashed experts are resident in memory but idle this token.
The router is a tiny linear layer + softmax. To make it sparse you keep only the top-k logits and send the rest to −∞. Add a touch of noise for exploration and balancing. Pick a token, set k, toggle the noise — watch the gate select experts live.
Left alone, a few experts hog all the traffic: they train faster, become more attractive, attract even more tokens, and the rest atrophy. You pay for N experts but use 2. The cure is an auxiliary load-balancing loss. Run the simulation with the aux-loss off, then on.
The product fᵢ·Pᵢ is smallest when load is uniform (fᵢ ≈ Pᵢ ≈ 1/N), so minimizing it nudges the router toward an even split without dictating which token goes where. Pᵢ is the differentiable handle gradients actually push on; fᵢ is the discrete count it shadows.
α stays tiny (≈0.01). Too large and the balancing term overrides the real task loss and hurts quality. A second router z-loss penalizes oversized router logits to keep the gate's exp() from overflowing — pure training stability.
When an expert overflows its fixed capacity buffer, the extra tokens are dropped — they skip the MoE and ride the residual through unchanged. A quiet quality leak, and a strong reason balance matters.
DeepSeek refined the recipe with three ideas the frontier adopted: split experts smaller, keep a few always on, and balance load without an auxiliary loss at all.
Split each expert into m smaller ones (shrink the hidden dim by 1/m). Active FLOPs are unchanged, but the ways to combine experts explode combinatorially — far richer routing.
Reserve a few experts that run for every token, absorbing common knowledge (grammar, general patterns) so the routed experts don't each redundantly relearn it.
V3 drops the aux loss. Each expert gets a learnable bias bᵢ added only to the selection score; a controller nudges it up/down to even out load — no gradient fighting the task.
DeepSeek-V3 ships 256 routed + 1 shared expert, top-8 routing, and the loss-free balancer — yielding 671B total / 37B active parameters. The MoE is only half the story; it's wrapped with Multi-head Latent Attention (KV-cache compression) and multi-token prediction.
An auxiliary loss is a regularizer fighting the task loss — balance and quality trade off. V3's bias bᵢ only steers who gets picked, never the gate weight that scales the output, so the controller can equalize load with no quality penalty.
Reported results: DeepSeekMoE 16B ≈ Llama-2-7B at ~40% of the compute; the 2B variant matched a GShard 2.9B with 1.5× fewer expert params. Sparsity, spent well.
This is the number that makes MoE matter. Dial N and k and watch total capacity grow while the per-token compute barely moves — then load a real frontier preset and see the gap.
The catch, baked in: VRAM tracks TOTAL, FLOPs track ACTIVE. MoE buys capacity with memory, not compute — every expert must stay resident even though only k run per token.
Every MoE design knob is a trade. The mental model: MoE buys capacity with memory, not compute. Whether that's a win depends on whether you're FLOP-bound or memory-bound.
| Lever | What it buys | What it costs |
|---|---|---|
| More experts (N) | more total capacity / knowledge | more VRAM, harder balancing, more all-to-all comms |
| Higher top-k | richer per-token mixing, better quality | more active FLOPs + routing overhead |
| Fine-grained experts | combinatorial routing flexibility | more routing decisions, scheduling complexity |
| Shared experts | less redundant relearning of common knowledge | a floor of always-paid compute |
| Higher capacity factor | fewer dropped tokens | wasted compute / memory on padding |
| Strong aux-loss α | even expert utilization | drags on task quality (→ V3's loss-free fix) |
Only k experts run per token — but all of them must sit in VRAM, because across a batch every expert gets used. Mixtral activates 13B but you must hold the full 47B. MoE saves compute, not memory.
And the routers are finicky: collapse, oscillation, exp()-overflow are all real. Noisy gating, aux losses, z-loss, and loss-free bias control are all patches on a fundamentally non-differentiable, discrete-decision problem — top-k selection isn't differentiable; we train around it.
Studies often find routing correlates with surface features (token IDs, syntax) more than clean semantic domains. The hospital-specialist intuition is a teaching aid, not a guarantee of interpretable specialization.
Reasoning vs. knowledge: at matched active-params, MoEs tend to shine on knowledge-heavy tasks and lag dense models on some reasoning tasks — specialists store facts; reasoning seems to want depth and shared computation. Plausible, not fully settled.
The hospital, the gated sum, noisy top-k routing, load-balancing collapse and its cures, DeepSeek's fine-grained / shared / loss-free tricks, and the active-vs-total budget that makes it all worth it. You now know why a 671B model can cost a 37B model to run. That's the lever frontier scaling rides on.
MoE only replaces the feed-forward block. Go back to the part it leaves untouched: how attention lets every token look at every other token.