microGPT showed the math is small. This shows the engineering is reachable: a ~10.8M-parameter GPT, written component by component in PyTorch, trained on 1 MB of Shakespeare to coherent text in ~45 minutes — every knob (context, heads, warmup, temperature) implemented by your own hand.
loading…
The whole model is one function: given a window of past tokens, output a probability over the next one. Train that by showing it text and asking it to guess. Sample from it in a loop and it writes.
A model never sees text — it sees integers. The simplest tokenizer maps every distinct character to an ID. Type below and watch the char-vocab grow, encode, and round-trip exactly.
Why characters and not subword BPE? On a ~1 MB corpus there are only ~4,225 possible bigrams, so every bigram appears many times — dense statistics a tiny model can actually learn. Swap in GPT-2's 50,257-token BPE and most pairs are too rare to estimate.
The reported gap is stark: BPE training loss stalls around ~6.3 on Shakespeare, while char-level reaches ~1.5. The cost is ~3× longer sequences — the right trade only because the data is small.
Token IDs become vectors, get a position added, then flow through 6 identical blocks of attention + MLP, each wrapped in pre-norm and a residual highway. Hover the stack to trace where the ~10.8M parameters live.
Embeddings. Two tables — token (wte, 65×384) and learned absolute position (wpe, 256×384) — are summed. The token table is weight-tied to the output head, forcing input and output representations consistent (and saving ~25K params).
MLP. Per block: Linear(384→1536) → GELU → Linear(1536→384) — the canonical 4× expansion. GELU is a smooth gate (no hard zero cutoff) that helps gradients flow.
Pre-norm (LayerNorm before each sublayer, GPT-2 style) stabilizes deep training; the residual highway lets gradients reach early layers undecayed. After 6 blocks: a final LayerNorm, then the tied head → 65 logits per position.
Attention lets each position read from earlier ones — but only earlier. Position i may attend to 0…i, never the future. Drag the query position and watch the causal mask and softmax weights light up.
One linear projects each vector to Q, K, V, then 384 dims reshape into 6 heads of 64. Each head scores how much query i wants each key j, softmaxes those scores into weights, and mixes the values.
The 1/√d scale keeps dot products from growing with dimension and saturating the softmax. Causality (is_causal=True) rides PyTorch's fused FlashAttention path — faster and more memory-frugal than materializing a triangular mask.
Heads run in parallel so they specialize: one might track which vowels follow consonants, another line-break patterns. Outputs concatenate back to 384 and pass an output projection.
The objective is next-token cross-entropy; the extras (AdamW, grad-clip, warmup→cosine LR) are what make it train, not just exist. Run training and watch val loss fall — then start memorizing.
AdamW (lr=1e-3, wd=0.01), grad-clip at global norm 1.0 to cap spikes, and warmup→cosine: big steps early to explore, small steps late to refine. Each step sees 64×256 = 16,384 supervised next-char targets — that density is why tiny models learn fast.
The honest punchline: best val loss (~1.57–1.64) lands around step 1,500–2,500. After that the 10.8M model overfits 1 MB and memorizes — the right move is early stopping, not all 5,000 steps. The overfit cliff is itself the lesson in data-vs-parameters.
A next-token model becomes a writer by sampling in a loop. Temperature and top-k act on the logits before softmax — runtime knobs, no retraining. Slide them and watch the next-char distribution morph.
Temperature rescales logits. T→0 → greedy & repetitive; T=1 → the raw distribution; T>1 flattens it, lifting rare tokens (creative but incoherent). The doc's sweet spot is T ≈ 0.7–0.9.
Top-k masks all but the k highest logits to −∞, truncating the unreliable long tail before sampling (k≈40 for the 65-char vocab). And we sample, not argmax — greedy decoding loops; sampling respects confidence while keeping variety.
Part 5 wires it together; Part 6 scales model and data on the same laptop so you feel the curve. The three preset configs are byte-for-byte nanoGPT's lineage.
~5 min. Learns letter shapes and spacing, babbles non-words. The fastest "is my loop even working?" check.
~20 min. Real words and short phrases emerge; the grammar starts to feel Shakespearean.
~45 min. Coherent (if meaningless) Shakespeare-flavored verse. The default — nanoGPT's char demo config, made literal.
| This workshop builds | …but a 2026 LLM uses | Why the gap matters |
|---|---|---|
| Learned absolute positions | RoPE (rotary) | relative positions generalize to longer context |
| Vanilla LayerNorm | RMSNorm | cheaper, no mean-subtraction needed |
| GELU MLP | SwiGLU / gated MLP | gating buys quality per parameter |
| Dense attention | GQA / MQA, sliding-window | shrinks the KV-cache for long contexts |
| Single-host fp32-ish | mixed precision, FSDP / tensor-parallel | the difference between a laptop and a cluster |
~10.8M params on 1 MB produces grammatical-looking Shakespeare noise — no semantics, no facts, no instruction-following. Nothing here transfers to "deploy a chatbot." The value isn't the babbler; it's that you implemented every knob by hand.
Architecturally this is a 2019/2022-era GPT-2, not a modern stack — but that's exactly why it's a clear teaching object. Char-level caps the ceiling on purpose; don't read its weakness as a Transformer weakness.
The "real-trainer extras" — AdamW, grad-clip, warmup-cosine, pre-norm, residuals, top-k — are precisely what microGPT keeps minimal. They convert a bare algorithm into something numerically stable that samples coherently. That conversion is the whole point.
Sampling is a runtime dial, not retraining: the same weights are "boring" at T=0.1 or "unhinged" at T=1.5. That is the exact lever an orchestrator — or an LLM council member — tunes per call.
Char vocab, the forward pass, causal attention, a stable training loop, sampling, and scale. A working GPT is a laptop-hour away — and you wrote every line. You now understand a model as a system, not a vending machine.
You wired attention into a working model. Now slow down and really understand query, key, value — the operation underneath all of it.