OpenAlice Academy — 02 · 06 / DeepSeek Architecture

01 / 07

The premise · capacity vs price-per-token

Keep the model huge. Pay for it thin.

Every modern LLM is a tall stack of two blocks: attention (tokens look at each other) and a feed-forward network (each token thinks alone). Most labs make both bigger. DeepSeek's bet is different: stay enormous in capacity, but tiny in what you actually pay per token.

OBSERVATION 1

the KV cache is the tax

To generate, a transformer must remember a Key + Value for every past token, head, and layer. At long context this dominates memory and bandwidth. MLA compresses it ~40–60×.

OBSERVATION 2

dense FFNs waste compute

A 671B dense model multiplies all 671B weights for every token. DeepSeekMoE routes each token to a few small experts — so V3 activates only 37B of 671B.

OBSERVATION 3

reasoning can be learned

Instead of paying humans to write step-by-step solutions, reward the model only on whether the final answer is correct. R1 grows its own chain-of-thought.

dense 671B model → 671B FLOPs / token // you pay for all of it DeepSeek-V3 → 37B FLOPs / token // 18× cheaper, same capacity

The lineage tells the story. V2 introduced MLA + DeepSeekMoE at 236B/21B. V3 scaled it to 671B/37B and bolted on Multi-Token Prediction and FP8 training, hitting GPT-4-class quality across the whole 14.8T-token run with no loss spike or rollback. R1 then took the V3 base and made it reason — purely through RL.

02 / 07

Cheaper attention · compress the memory

Multi-head Latent Attention.

Standard attention caches a full Key + Value per token, per head, per layer. MLA caches one small latent vector instead, and reconstructs K and V on the fly. Drag the context length and watch the cache blow up — then collapse.

FIG.02 — KV-CACHE PER TOKEN · MHA vs MLA · V3 CONFIG

SHORT 128K

■ MHA (32,768 / tok / layer)■ MLA (576 / tok / layer)

MHA CACHE (TOTAL)—

MLA CACHE (TOTAL)—

REDUCTION FACTOR—

// down-project, cache the latent, unfold heads later c_KV = h_t · W_DKV 7168 → d_c = 512 ← cached k_C = c_KV · W_UK up-project to per-head keys v_C = c_KV · W_UV up-project to per-head values

V3 has 128 heads × 128 dims × 2 = 32,768 floats cached per token per layer in plain MHA. MLA caches just the 512-float latent c_KV.

The RoPE wrinkle. Rotary positions don't commute with the "absorbed" matmul, so DeepSeek splits each key into a content part (from the latent, no RoPE) plus a tiny decoupled RoPE part of 64 dims that carries position. Real cache = 512 + 64 = 576 floats per layer.

V2 reported a 93.3% KV-cache cut and 5.76× higher max throughput vs the dense DeepSeek-67B. Weight absorption: pre-multiply W_UK into W_Q so you attend directly in latent space — that's what makes MLA faster, not just smaller.

03 / 07

Cheaper FFN · fine-grained experts

One shared expert, eight of 256.

A Mixture-of-Experts replaces the single big FFN with many small expert FFNs and a router that sends each token to a few. DeepSeekMoE slices experts thin and always runs one shared expert for the common knowledge. Click a token and watch the router light up its 8.

FIG.03 — V3 MoE LAYER · 1 SHARED + TOP-8 of 256 ROUTED

ROUTE A TOKEN — click any pill

SHARED →

always runs · holds grammar & formatting

256 ROUTED EXPERTS — top-8 selected for this token

EXPERTS PER TOKEN1 shared + 8 routed

TOTAL PARAMETERS671B

ACTIVE PER TOKEN37B

ROUTED EXPERTS LIT—

FINE-GRAINED SEGMENTATION

Instead of a few fat experts, use many thin ones — 256 of them. Slicing finer lets the router combine specialists more precisely: many more knowledge combinations for the same active-parameter budget.

Shared-expert isolation. Some knowledge — grammar, common formatting — is needed by every token. Forcing the router to re-learn it in every expert wastes capacity, so DeepSeekMoE carves out 1 shared expert that always runs alongside the routed 8. The shared expert holds the common stuff; the routed ones specialise.

The cost shifts from FLOPs to memory + interconnect: "37B active" is the compute story, but you still need enough GPUs to hold all 671B in memory, which is why expert-parallelism and all-to-all routing comms exist.

04 / 07

V3's headline trick · balance without a loss

Auxiliary-loss-free balancing.

MoEs collapse if the router falls in love with a few favourite experts. The classic fix is an auxiliary loss — but that loss fights the language objective and hurts quality. V3 drops it and uses a tiny control loop on a per-expert bias. Press a step and watch the load even out.

FIG.04 — LOAD PER EXPERT · BIAS CONTROL LOOP

TOKENS-PER-EXPERT — ink = overloaded · rose = balanced target

8 experts target = even split (dashed)

step 0

IMBALANCE (max − min)—

OVERLOADED EXPERTS—

// after each step, nudge the selection bias b_i if expert i overloaded: b_i ← b_i − γ if expert i underloaded: b_i ← b_i + γ // b_i shifts only the top-K SELECTION, // NOT the gate value that weights the output

The key subtlety: the bias b_i is added only to the routing score used for selection — never to the gate value that actually weights the expert's output. So overloaded experts get quietly down-ranked in selection while their contribution weights stay clean.

Balance is enforced without a gradient that corrupts the loss. The cost: one control-loop hyperparameter γ, the "bias update speed" — and it's a heuristic, not a guarantee.

V2 still used device-limited routing + small balance losses; V3's auxiliary-loss-free scheme is the cleaner successor.

05 / 07

Denser signal · cheaper arithmetic

Predict several tokens. In 8 bits.

Two more independent cost-cuts. Multi-Token Prediction squeezes more learning from every token by predicting the next few. FP8 training halves the bit-width of the heavy matmuls.

FIG.05 — MTP · 1 vs N PREDICTION TARGETS PER POSITION

depth 1 depth 4

PREDICTION DEPTH—

TARGETS PER POSITION—

EXTRA USE FROM SAME DATA—

Standard LLMs predict one next token per position. MTP adds small sequential modules — each with its own output head but sharing the trunk — that predict the next few. Two payoffs: a denser learning signal (more targets per token of the 14.8T corpus), and free speculative decoding (the extra heads propose tokens a verify-pass accepts or rejects).

FP8 — 8-BIT TRAINING, DONE CAREFULLY

FP8 has tiny dynamic range, so naïvely it overflows. DeepSeek's recipe: fine-grained tile/block scaling (1×128 activation tiles, 128×128 weight blocks), BF16/FP32 islands for the sensitive parts (embeddings, output head, norms, softmax, master weights), and high-precision accumulation of the FP8 products.

Combined with the DualPipe schedule — which overlaps computation with the all-to-all MoE-routing communication — this is how the run finished in ~2.79M H800-hours with no loss spikes or rollbacks. MTP modules can be dropped at inference, or reused for speculation.

06 / 07

Learning to reason · no critic, no labels

R1 learns to think with GRPO.

PPO — the usual RLHF algorithm — needs a separate critic network the same size as the policy. GRPO deletes it: sample a group of G answers to one prompt, score them, and use the group's own mean and spread as the baseline. Roll a fresh group and watch the advantages fall out.

FIG.06 — GRPO · GROUP-RELATIVE ADVANTAGE · ONE PROMPT

G = 6 rollouts

GROUP MEAN REWARD—

GROUP STD—

SAMPLES PUSHED UP / DOWN—

// group-relative advantage — no value network Â_i = ( r_i − mean(r) ) / ( std(r) + ε ) // every token in output i shares Â_i // clipped PPO surrogate, KL subtracted in the LOSS J = E[ min( ρ·Â , clip(ρ,1−ε,1+ε)·Â ) − β·D_KL ]

The advantage is relative: "was this answer better or worse than its siblings on the same prompt?" — exactly the signal you want, and it needs no learned value model, halving the RL memory.

RULE-BASED REWARDS (RLVR)

R1 mostly avoids a learned reward model (which can be reward-hacked). For verifiable tasks the reward is mechanical: accuracy (does the boxed answer match? do the unit tests pass in a sandbox?) plus a format reward (did it wrap thinking in <think>…</think>?).

The "aha moment": GRPO applied to the V3 base with no SFT at all gave R1-Zero, which spontaneously learned to write "wait, let me reconsider," backtrack, and verify itself — AIME climbed ~15.6% → ~71% during RL. The shipped R1 adds a 4-stage pipeline (cold-start SFT → reasoning RL → rejection-sample + SFT → final RL) to fix R1-Zero's language-mixing and readability.

07 / 07

Why it compounds · and the honest caveats

The stack, and what it really costs.

No single equation is "the DeepSeek architecture." Each trick attacks a different cost axis, and because they're roughly independent they stack multiplicatively. Here is the whole ledger — what each buys, and what it costs.

Idea	Cost axis	Buys you	Costs you
MLA	KV-cache memory	~40–60× smaller cache, big long-context throughput	extra projections; decoupled-RoPE complexity; latent read twice
DeepSeekMoE	FLOPs / token	huge capacity, tiny active compute, precise specialization	all-to-all comms; must hold all experts in memory
Aux-loss-free	balancing	balance without a loss that hurts quality	adds the γ heuristic; not a guarantee
MTP	training signal	denser signal + free speculative decoding	extra modules during training; modest, not magical
FP8	bit-width	~½ memory, faster GEMMs, lower $$	needs fine-grained scaling + BF16 islands; brittle if naïve
GRPO	RL infrastructure	no critic network (½ RL memory); clean relative signal	needs G samples / prompt; wants verifiable rewards
Rule-based RL	human labels	no expensive reasoning labels; resists reward hacking	only where the answer is mechanically checkable

THE HONEST "$5.6M"

That figure is the final training run, not total cost. It excludes prior R&D, failed runs, data, salaries, and the hardware itself. It's real and impressive — but it is not "anyone can build GPT-4 for $6M."

MLA's win is context-dependent (it shines at long context + high batch; bandwidth savings are ~half the memory savings). Serving a 671B MoE is not cheap — "37B active" is FLOPs, but you still hold all 671B in GPU memory.

RULE-BASED RL HAS A HARD EDGE

R1-Zero's magic needs a checkable answer. For open-ended generation (essays, dialogue, taste) you're back to learned reward models or human preference — which can be reward-hacked. R1's later stages reintroduce SFT and general alignment for exactly this reason.

Distillation works: SFT small dense models (Qwen / Llama, 1.5B→70B) on R1's outputs and reasoning transfers cheaply — distilled-Qwen-32B beats much larger models, and the paper finds distillation-from-a-strong-reasoner beats running RL directly on the small model. Reproducing FP8 + DualPipe is genuinely hard — the RL recipe was reproduced far faster than the pre-training stack.

02 · 06 — you made it

You understand
the whole stack.

MLA squeezed the KV cache. DeepSeekMoE cut the FLOPs. Aux-loss-free balancing kept the router honest. MTP and FP8 densified the signal and halved the bits. GRPO and rule-based rewards grew reasoning with no critic and no labels. Five independent cost-cuts, multiplied. That is how a $5.6M run rivalled the frontier.

02·04 Mixture-of-Experts · routers, top-K, the load-balance problem ✓ done

02·05 Quantization & FP8 · trading bits for memory and speed ✓ done

02·06 DeepSeek architecture · MLA · MoE · MTP · FP8 · GRPO / R1 ✓ complete

02·07 Test-time compute & reasoning · when "think longer" actually pays next

Continue · the foundation