An LLM is not magic. It is a dataset, a tokenizer, an autograd engine, one attention block, a training loop, and a sampling loop — and the whole thing fits in pure Python with zero dependencies. The trillion-dollar part is scale, not mystery. Finish this page and an LLM becomes a thing you can hold in your head.
loading…
microGPT is one rung on a ladder. micrograd is the scalar autograd engine alone; microGPT is micrograd + a transformer on top; nanoGPT scales it to real GPT-2 pretraining; nanochat adds the whole ChatGPT pipeline. Each rung trades education for capability.
Dataset, tokenizer, autograd, attention, training, generation — ~300 lines of pure Python, zero dependencies. No PyTorch, no NumPy, no CUDA. Karpathy: "This is the full algorithmic content of what is needed. Everything else is just for efficiency."
Each name is one "document," wrapped in a BOS token that marks both start and end.
1,000 training steps on a laptop. Loss falls from 3.3 (random) to 2.37.
Random guessing over 27 tokens. Dropping below it means it learned real letter structure.
The gap between this toy and ChatGPT is scale + engineering + post-training — data volume, learned tokenizers, distributed GPUs, RLHF — not a fundamentally different idea. Every model OpenAlice orchestrates is this, scaled. Seeing the 300 lines makes the whole stack legible.
microGPT is built by adding exactly one idea at a time — Karpathy's train0…train5 ladder. Press Walk → to light each stage and watch the build assemble from a frequency table into a transformer.
Each number is a Value that records not just its data but how it was computed. Call loss.backward() and the chain rule sweeps the whole graph — this is literally what PyTorch does, one scalar at a time.
backward() does two things: a topological sort so every node comes after what it depends on, then a reverse sweep pushing gradient into children — child.grad += local · node.grad.
Gradients use +=, never =. When a value feeds several ops, gradients from every path must be summed. That's the multivariable chain rule, handled for free by accumulation.
Each token becomes a Query, a Key, and a Value. A position scores its query against every earlier key, softmaxes the scores into weights, and pulls a weighted blend of values. Click a token below to make it the query and watch where its attention flows.
The √dₖ scaling (here √4 = 2) keeps the dot-products from growing with dimension and saturating the softmax — the trick from Attention Is All You Need. Causality is automatic: a position only ever sees keys cached before it. Attention is routing — it decides which earlier tokens matter for predicting the next one.
Four parts make the block. RMSNorm keeps magnitudes sane, attention routes across positions, the MLP thinks per position, and residuals carry gradients straight back. microGPT runs exactly one block; frontier models stack a hundred-plus.
scale = (mean(xᵢ²) + ε)−½; xᵢ ← xᵢ·scale. No parameters — just keeps activations well-behaved.
Mixes information between positions — the only place tokens talk to each other. Then + residual.
16→64→relu→16, each position independent. The nonlinear per-position compute. Then + residual.
β₁=0.85 · β₂=0.99 · lr=0.01 with linear decay to zero. Real Adam (Kingma & Ba), not a toy.
Run the forward pass → get 27 logits → divide by temperature → softmax → draw a token weighted by those probabilities → feed it back. Drag the knob: <1 sharpens (safe, repetitive), >1 flattens (diverse, more typos).
probs = softmax( logits / T )
token = random.choices(vocab, weights=probs)
Side by side, the difference is stark in size and identical in idea. Hover a row. The algorithm is the same; the gap is data, engineering, and post-training.
| dimension | microGPT | a frontier LLM |
|---|---|---|
| lines of code | ~300, no deps | millions, many frameworks |
| parameters | 4,192 | 10¹¹ – 10¹² |
| vocab | 27 chars | ~100k learned subwords |
| config | n_embd 16 · n_head 4 · n_layer 1 | thousands × tens × 100+ |
| data | 32,033 names | trillions of tokens |
| training | ~1 min · MacBook CPU | months · huge GPU fleets |
| loss | 3.3 → 2.37 | — |
microGPT teaches pretraining only. No SFT, no RLHF, no tool use — the things that turn a base model into an assistant. Generating names ≠ language. The point is to see the machine, not to be good.
Context window = block_size. KV cache = why long contexts cost memory. Temperature = Alice's per-chat diversity knob. You now reason about the stack because you've seen the toy version.
An autograd engine. Attention. One block. Adam. Temperature sampling. The same loop, at enormous scale, is what drives ChatGPT — over ~100k subwords instead of 27 chars. You now hold the core algorithm.
A 10M-param GPT trained on a laptop, end to end — real pretraining, the rung above this one.
→