DeepSeek-V3 has 671 billion parameters but spends only 37 billion per token. It was trained for about $5.6M of rented compute and matches GPT-4-class quality — then DeepSeek-R1 turned that base into an o1-class reasoner using reinforcement learning with no human reasoning labels. This is not one trick. It is a stack of cost-cuts that compound.
loading…
Each trick attacks a different cost: MLA shrinks the KV-cache memory, MoE cuts the FLOPs per token, MTP densifies the training signal, FP8 halves the bit-width, and GRPO removes the RL critic and the human labels. Because they're roughly independent, they multiply.
Every modern LLM is a tall stack of two blocks: attention (tokens look at each other) and a feed-forward network (each token thinks alone). Most labs make both bigger. DeepSeek's bet is different: stay enormous in capacity, but tiny in what you actually pay per token.
To generate, a transformer must remember a Key + Value for every past token, head, and layer. At long context this dominates memory and bandwidth. MLA compresses it ~40–60×.
A 671B dense model multiplies all 671B weights for every token. DeepSeekMoE routes each token to a few small experts — so V3 activates only 37B of 671B.
Instead of paying humans to write step-by-step solutions, reward the model only on whether the final answer is correct. R1 grows its own chain-of-thought.
The lineage tells the story. V2 introduced MLA + DeepSeekMoE at 236B/21B. V3 scaled it to 671B/37B and bolted on Multi-Token Prediction and FP8 training, hitting GPT-4-class quality across the whole 14.8T-token run with no loss spike or rollback. R1 then took the V3 base and made it reason — purely through RL.
Standard attention caches a full Key + Value per token, per head, per layer. MLA caches one small latent vector instead, and reconstructs K and V on the fly. Drag the context length and watch the cache blow up — then collapse.
V3 has 128 heads × 128 dims × 2 = 32,768 floats cached per token per layer in plain MHA. MLA caches just the 512-float latent c_KV.
The RoPE wrinkle. Rotary positions don't commute with the "absorbed" matmul, so DeepSeek splits each key into a content part (from the latent, no RoPE) plus a tiny decoupled RoPE part of 64 dims that carries position. Real cache = 512 + 64 = 576 floats per layer.
V2 reported a 93.3% KV-cache cut and 5.76× higher max throughput vs the dense DeepSeek-67B. Weight absorption: pre-multiply WUK into WQ so you attend directly in latent space — that's what makes MLA faster, not just smaller.
A Mixture-of-Experts replaces the single big FFN with many small expert FFNs and a router that sends each token to a few. DeepSeekMoE slices experts thin and always runs one shared expert for the common knowledge. Click a token and watch the router light up its 8.
Instead of a few fat experts, use many thin ones — 256 of them. Slicing finer lets the router combine specialists more precisely: many more knowledge combinations for the same active-parameter budget.
Shared-expert isolation. Some knowledge — grammar, common formatting — is needed by every token. Forcing the router to re-learn it in every expert wastes capacity, so DeepSeekMoE carves out 1 shared expert that always runs alongside the routed 8. The shared expert holds the common stuff; the routed ones specialise.
The cost shifts from FLOPs to memory + interconnect: "37B active" is the compute story, but you still need enough GPUs to hold all 671B in memory, which is why expert-parallelism and all-to-all routing comms exist.
MoEs collapse if the router falls in love with a few favourite experts. The classic fix is an auxiliary loss — but that loss fights the language objective and hurts quality. V3 drops it and uses a tiny control loop on a per-expert bias. Press a step and watch the load even out.
The key subtlety: the bias b_i is added only to the routing score used for selection — never to the gate value that actually weights the expert's output. So overloaded experts get quietly down-ranked in selection while their contribution weights stay clean.
Balance is enforced without a gradient that corrupts the loss. The cost: one control-loop hyperparameter γ, the "bias update speed" — and it's a heuristic, not a guarantee.
V2 still used device-limited routing + small balance losses; V3's auxiliary-loss-free scheme is the cleaner successor.
Two more independent cost-cuts. Multi-Token Prediction squeezes more learning from every token by predicting the next few. FP8 training halves the bit-width of the heavy matmuls.
Standard LLMs predict one next token per position. MTP adds small sequential modules — each with its own output head but sharing the trunk — that predict the next few. Two payoffs: a denser learning signal (more targets per token of the 14.8T corpus), and free speculative decoding (the extra heads propose tokens a verify-pass accepts or rejects).
FP8 has tiny dynamic range, so naïvely it overflows. DeepSeek's recipe: fine-grained tile/block scaling (1×128 activation tiles, 128×128 weight blocks), BF16/FP32 islands for the sensitive parts (embeddings, output head, norms, softmax, master weights), and high-precision accumulation of the FP8 products.
Combined with the DualPipe schedule — which overlaps computation with the all-to-all MoE-routing communication — this is how the run finished in ~2.79M H800-hours with no loss spikes or rollbacks. MTP modules can be dropped at inference, or reused for speculation.
PPO — the usual RLHF algorithm — needs a separate critic network the same size as the policy. GRPO deletes it: sample a group of G answers to one prompt, score them, and use the group's own mean and spread as the baseline. Roll a fresh group and watch the advantages fall out.
The advantage is relative: "was this answer better or worse than its siblings on the same prompt?" — exactly the signal you want, and it needs no learned value model, halving the RL memory.
R1 mostly avoids a learned reward model (which can be reward-hacked). For verifiable tasks the reward is mechanical: accuracy (does the boxed answer match? do the unit tests pass in a sandbox?) plus a format reward (did it wrap thinking in <think>…</think>?).
The "aha moment": GRPO applied to the V3 base with no SFT at all gave R1-Zero, which spontaneously learned to write "wait, let me reconsider," backtrack, and verify itself — AIME climbed ~15.6% → ~71% during RL. The shipped R1 adds a 4-stage pipeline (cold-start SFT → reasoning RL → rejection-sample + SFT → final RL) to fix R1-Zero's language-mixing and readability.
No single equation is "the DeepSeek architecture." Each trick attacks a different cost axis, and because they're roughly independent they stack multiplicatively. Here is the whole ledger — what each buys, and what it costs.
| Idea | Cost axis | Buys you | Costs you |
|---|---|---|---|
| MLA | KV-cache memory | ~40–60× smaller cache, big long-context throughput | extra projections; decoupled-RoPE complexity; latent read twice |
| DeepSeekMoE | FLOPs / token | huge capacity, tiny active compute, precise specialization | all-to-all comms; must hold all experts in memory |
| Aux-loss-free | balancing | balance without a loss that hurts quality | adds the γ heuristic; not a guarantee |
| MTP | training signal | denser signal + free speculative decoding | extra modules during training; modest, not magical |
| FP8 | bit-width | ~½ memory, faster GEMMs, lower $$ | needs fine-grained scaling + BF16 islands; brittle if naïve |
| GRPO | RL infrastructure | no critic network (½ RL memory); clean relative signal | needs G samples / prompt; wants verifiable rewards |
| Rule-based RL | human labels | no expensive reasoning labels; resists reward hacking | only where the answer is mechanically checkable |
That figure is the final training run, not total cost. It excludes prior R&D, failed runs, data, salaries, and the hardware itself. It's real and impressive — but it is not "anyone can build GPT-4 for $6M."
MLA's win is context-dependent (it shines at long context + high batch; bandwidth savings are ~half the memory savings). Serving a 671B MoE is not cheap — "37B active" is FLOPs, but you still hold all 671B in GPU memory.
R1-Zero's magic needs a checkable answer. For open-ended generation (essays, dialogue, taste) you're back to learned reward models or human preference — which can be reward-hacked. R1's later stages reintroduce SFT and general alignment for exactly this reason.
Distillation works: SFT small dense models (Qwen / Llama, 1.5B→70B) on R1's outputs and reasoning transfers cheaply — distilled-Qwen-32B beats much larger models, and the paper finds distillation-from-a-strong-reasoner beats running RL directly on the small model. Reproducing FP8 + DualPipe is genuinely hard — the RL recipe was reproduced far faster than the pre-training stack.
MLA squeezed the KV cache. DeepSeekMoE cut the FLOPs. Aux-loss-free balancing kept the router honest. MTP and FP8 densified the signal and halved the bits. GRPO and rule-based rewards grew reasoning with no critic and no labels. Five independent cost-cuts, multiplied. That is how a $5.6M run rivalled the frontier.
MLA is an attention variant and decoupled RoPE is a positional-encoding move. Go back to the block they both rewrite — the heart of every modern LLM.