openalicelabs / academy
COURSE SYS-02 LESSON 02 · 06 TOPIC DEEPSEEK ARCH EST. READ ~13 MIN
OPENALICE LABORATORIES · EDUCATION PATH · SYSTEMS 02 · 06

More model,
less
bill.

DeepSeek-V3 has 671 billion parameters but spends only 37 billion per token. It was trained for about $5.6M of rented compute and matches GPT-4-class quality — then DeepSeek-R1 turned that base into an o1-class reasoner using reinforcement learning with no human reasoning labels. This is not one trick. It is a stack of cost-cuts that compound.

FIG.00 — THE EFFICIENCY STACK
loading…
FIG.0A — FIVE INDEPENDENT COST AXES · MEMORY · FLOPS · SIGNAL · BITS · RL

Each trick attacks a different cost: MLA shrinks the KV-cache memory, MoE cuts the FLOPs per token, MTP densifies the training signal, FP8 halves the bit-width, and GRPO removes the RL critic and the human labels. Because they're roughly independent, they multiply.

LINEAGEV2 (236B/21B) → V3 (671B/37B) → R1 (reasoning)
V3 TRAIN COST~2.79M H800 GPU-hours ≈ $5.6M (final run)
PRE-TRAIN DATA14.8 trillion tokens, no loss spikes
ATTENTIONMLA — KV cache cut ~93% vs dense 67B
FFN1 shared + 8 of 256 routed experts per token
REASONINGGRPO RL · rule-based rewards · the "aha moment"
01 / 07
The premise · capacity vs price-per-token

Keep the model huge. Pay for it thin.

Every modern LLM is a tall stack of two blocks: attention (tokens look at each other) and a feed-forward network (each token thinks alone). Most labs make both bigger. DeepSeek's bet is different: stay enormous in capacity, but tiny in what you actually pay per token.

OBSERVATION 1

the KV cache is the tax

To generate, a transformer must remember a Key + Value for every past token, head, and layer. At long context this dominates memory and bandwidth. MLA compresses it ~40–60×.

OBSERVATION 2

dense FFNs waste compute

A 671B dense model multiplies all 671B weights for every token. DeepSeekMoE routes each token to a few small experts — so V3 activates only 37B of 671B.

OBSERVATION 3

reasoning can be learned

Instead of paying humans to write step-by-step solutions, reward the model only on whether the final answer is correct. R1 grows its own chain-of-thought.

dense 671B model → 671B FLOPs / token // you pay for all of it DeepSeek-V337B FLOPs / token // 18× cheaper, same capacity

The lineage tells the story. V2 introduced MLA + DeepSeekMoE at 236B/21B. V3 scaled it to 671B/37B and bolted on Multi-Token Prediction and FP8 training, hitting GPT-4-class quality across the whole 14.8T-token run with no loss spike or rollback. R1 then took the V3 base and made it reason — purely through RL.

02 / 07
Cheaper attention · compress the memory

Multi-head Latent Attention.

Standard attention caches a full Key + Value per token, per head, per layer. MLA caches one small latent vector instead, and reconstructs K and V on the fly. Drag the context length and watch the cache blow up — then collapse.

FIG.02 — KV-CACHE PER TOKEN · MHA vs MLA · V3 CONFIG
SHORT 128K
■ MHA (32,768 / tok / layer)■ MLA (576 / tok / layer)

MHA CACHE (TOTAL)
MLA CACHE (TOTAL)
REDUCTION FACTOR
// down-project, cache the latent, unfold heads later c_KV = ht · WDKV 7168 → d_c = 512 ← cached kC = c_KV · WUK up-project to per-head keys vC = c_KV · WUV up-project to per-head values

V3 has 128 heads × 128 dims × 2 = 32,768 floats cached per token per layer in plain MHA. MLA caches just the 512-float latent c_KV.

The RoPE wrinkle. Rotary positions don't commute with the "absorbed" matmul, so DeepSeek splits each key into a content part (from the latent, no RoPE) plus a tiny decoupled RoPE part of 64 dims that carries position. Real cache = 512 + 64 = 576 floats per layer.

V2 reported a 93.3% KV-cache cut and 5.76× higher max throughput vs the dense DeepSeek-67B. Weight absorption: pre-multiply WUK into WQ so you attend directly in latent space — that's what makes MLA faster, not just smaller.

03 / 07
Cheaper FFN · fine-grained experts

One shared expert, eight of 256.

A Mixture-of-Experts replaces the single big FFN with many small expert FFNs and a router that sends each token to a few. DeepSeekMoE slices experts thin and always runs one shared expert for the common knowledge. Click a token and watch the router light up its 8.

FIG.03 — V3 MoE LAYER · 1 SHARED + TOP-8 of 256 ROUTED
ROUTE A TOKEN — click any pill
SHARED →
S
always runs · holds grammar & formatting
256 ROUTED EXPERTS — top-8 selected for this token

EXPERTS PER TOKEN1 shared + 8 routed
TOTAL PARAMETERS671B
ACTIVE PER TOKEN37B
ROUTED EXPERTS LIT
FINE-GRAINED SEGMENTATION

Instead of a few fat experts, use many thin ones — 256 of them. Slicing finer lets the router combine specialists more precisely: many more knowledge combinations for the same active-parameter budget.

Shared-expert isolation. Some knowledge — grammar, common formatting — is needed by every token. Forcing the router to re-learn it in every expert wastes capacity, so DeepSeekMoE carves out 1 shared expert that always runs alongside the routed 8. The shared expert holds the common stuff; the routed ones specialise.

The cost shifts from FLOPs to memory + interconnect: "37B active" is the compute story, but you still need enough GPUs to hold all 671B in memory, which is why expert-parallelism and all-to-all routing comms exist.

04 / 07
V3's headline trick · balance without a loss

Auxiliary-loss-free balancing.

MoEs collapse if the router falls in love with a few favourite experts. The classic fix is an auxiliary loss — but that loss fights the language objective and hurts quality. V3 drops it and uses a tiny control loop on a per-expert bias. Press a step and watch the load even out.

FIG.04 — LOAD PER EXPERT · BIAS CONTROL LOOP
TOKENS-PER-EXPERT — ink = overloaded · rose = balanced target
8 experts target = even split (dashed)
step 0
IMBALANCE (max − min)
OVERLOADED EXPERTS
// after each step, nudge the selection bias b_i if expert i overloaded: bi ← bi − γ if expert i underloaded: bi ← bi + γ // b_i shifts only the top-K SELECTION, // NOT the gate value that weights the output

The key subtlety: the bias b_i is added only to the routing score used for selection — never to the gate value that actually weights the expert's output. So overloaded experts get quietly down-ranked in selection while their contribution weights stay clean.

Balance is enforced without a gradient that corrupts the loss. The cost: one control-loop hyperparameter γ, the "bias update speed" — and it's a heuristic, not a guarantee.

V2 still used device-limited routing + small balance losses; V3's auxiliary-loss-free scheme is the cleaner successor.

05 / 07
Denser signal · cheaper arithmetic

Predict several tokens. In 8 bits.

Two more independent cost-cuts. Multi-Token Prediction squeezes more learning from every token by predicting the next few. FP8 training halves the bit-width of the heavy matmuls.

FIG.05 — MTP · 1 vs N PREDICTION TARGETS PER POSITION
depth 1 depth 4

PREDICTION DEPTH
TARGETS PER POSITION
EXTRA USE FROM SAME DATA

Standard LLMs predict one next token per position. MTP adds small sequential modules — each with its own output head but sharing the trunk — that predict the next few. Two payoffs: a denser learning signal (more targets per token of the 14.8T corpus), and free speculative decoding (the extra heads propose tokens a verify-pass accepts or rejects).

FP8 — 8-BIT TRAINING, DONE CAREFULLY

FP8 has tiny dynamic range, so naïvely it overflows. DeepSeek's recipe: fine-grained tile/block scaling (1×128 activation tiles, 128×128 weight blocks), BF16/FP32 islands for the sensitive parts (embeddings, output head, norms, softmax, master weights), and high-precision accumulation of the FP8 products.

Combined with the DualPipe schedule — which overlaps computation with the all-to-all MoE-routing communication — this is how the run finished in ~2.79M H800-hours with no loss spikes or rollbacks. MTP modules can be dropped at inference, or reused for speculation.

06 / 07
Learning to reason · no critic, no labels

R1 learns to think with GRPO.

PPO — the usual RLHF algorithm — needs a separate critic network the same size as the policy. GRPO deletes it: sample a group of G answers to one prompt, score them, and use the group's own mean and spread as the baseline. Roll a fresh group and watch the advantages fall out.

FIG.06 — GRPO · GROUP-RELATIVE ADVANTAGE · ONE PROMPT
G = 6 rollouts

GROUP MEAN REWARD
GROUP STD
SAMPLES PUSHED UP / DOWN
// group-relative advantage — no value network Âi = ( ri − mean(r) ) / ( std(r) + ε ) // every token in output i shares Â_i // clipped PPO surrogate, KL subtracted in the LOSS J = E[ min( ρ·Â , clip(ρ,1−ε,1+ε)·Â ) − β·DKL ]

The advantage is relative: "was this answer better or worse than its siblings on the same prompt?" — exactly the signal you want, and it needs no learned value model, halving the RL memory.

RULE-BASED REWARDS (RLVR)

R1 mostly avoids a learned reward model (which can be reward-hacked). For verifiable tasks the reward is mechanical: accuracy (does the boxed answer match? do the unit tests pass in a sandbox?) plus a format reward (did it wrap thinking in <think>…</think>?).

The "aha moment": GRPO applied to the V3 base with no SFT at all gave R1-Zero, which spontaneously learned to write "wait, let me reconsider," backtrack, and verify itself — AIME climbed ~15.6% → ~71% during RL. The shipped R1 adds a 4-stage pipeline (cold-start SFT → reasoning RL → rejection-sample + SFT → final RL) to fix R1-Zero's language-mixing and readability.

07 / 07
Why it compounds · and the honest caveats

The stack, and what it really costs.

No single equation is "the DeepSeek architecture." Each trick attacks a different cost axis, and because they're roughly independent they stack multiplicatively. Here is the whole ledger — what each buys, and what it costs.

IdeaCost axisBuys youCosts you
MLAKV-cache memory~40–60× smaller cache, big long-context throughputextra projections; decoupled-RoPE complexity; latent read twice
DeepSeekMoEFLOPs / tokenhuge capacity, tiny active compute, precise specializationall-to-all comms; must hold all experts in memory
Aux-loss-freebalancingbalance without a loss that hurts qualityadds the γ heuristic; not a guarantee
MTPtraining signaldenser signal + free speculative decodingextra modules during training; modest, not magical
FP8bit-width~½ memory, faster GEMMs, lower $$needs fine-grained scaling + BF16 islands; brittle if naïve
GRPORL infrastructureno critic network (½ RL memory); clean relative signalneeds G samples / prompt; wants verifiable rewards
Rule-based RLhuman labelsno expensive reasoning labels; resists reward hackingonly where the answer is mechanically checkable
THE HONEST "$5.6M"

That figure is the final training run, not total cost. It excludes prior R&D, failed runs, data, salaries, and the hardware itself. It's real and impressive — but it is not "anyone can build GPT-4 for $6M."

MLA's win is context-dependent (it shines at long context + high batch; bandwidth savings are ~half the memory savings). Serving a 671B MoE is not cheap — "37B active" is FLOPs, but you still hold all 671B in GPU memory.

RULE-BASED RL HAS A HARD EDGE

R1-Zero's magic needs a checkable answer. For open-ended generation (essays, dialogue, taste) you're back to learned reward models or human preference — which can be reward-hacked. R1's later stages reintroduce SFT and general alignment for exactly this reason.

Distillation works: SFT small dense models (Qwen / Llama, 1.5B→70B) on R1's outputs and reasoning transfers cheaply — distilled-Qwen-32B beats much larger models, and the paper finds distillation-from-a-strong-reasoner beats running RL directly on the small model. Reproducing FP8 + DualPipe is genuinely hard — the RL recipe was reproduced far faster than the pre-training stack.

02 · 06 — you made it

You understand
the whole stack.

MLA squeezed the KV cache. DeepSeekMoE cut the FLOPs. Aux-loss-free balancing kept the router honest. MTP and FP8 densified the signal and halved the bits. GRPO and rule-based rewards grew reasoning with no critic and no labels. Five independent cost-cuts, multiplied. That is how a $5.6M run rivalled the frontier.

02·04 Mixture-of-Experts · routers, top-K, the load-balance problem ✓ done
02·05 Quantization & FP8 · trading bits for memory and speed ✓ done
02·06 DeepSeek architecture · MLA · MoE · MTP · FP8 · GRPO / R1 ✓ complete
02·07 Test-time compute & reasoning · when "think longer" actually pays next
Continue · the foundation

Attention & Transformers →

MLA is an attention variant and decoupled RoPE is a positional-encoding move. Go back to the block they both rewrite — the heart of every modern LLM.

openalicelabs