openalicelabs / academy
LESSON 00 · 03 microGPT — LLM FROM SCRATCH EST. READ ~11 MIN LIGHT · v0.1
OPENALICE LABORATORIES · EDUCATION PATH · RUNG 03

A whole LLM
in about
300 lines.

An LLM is not magic. It is a dataset, a tokenizer, an autograd engine, one attention block, a training loop, and a sampling loop — and the whole thing fits in pure Python with zero dependencies. The trillion-dollar part is scale, not mystery. Finish this page and an LLM becomes a thing you can hold in your head.

FIG.00 — RECURSIVE MOTIF
loading…
FIG.0A — THE KARPATHY LINEAGE · ONE IDEA PER RUNG · EDUCATION ↔ CAPABILITY

microGPT is one rung on a ladder. micrograd is the scalar autograd engine alone; microGPT is micrograd + a transformer on top; nanoGPT scales it to real GPT-2 pretraining; nanochat adds the whole ChatGPT pipeline. Each rung trades education for capability.

FIG.01 — ONE TRANSFORMER BLOCK · ATTENTION ROUTES (SOLID) · MLP THINKS · RESIDUALS CARRY
TASKgenerate plausible names · 32,033-name corpus
VOCAB27 tokens — 26 letters + 1 BOS delimiter
PARAMETERS4,192 — a frontier model has 10¹¹–10¹²
ENGINEscalar autograd · attention · Adam · temperature
01 / 08
The one-sentence idea

A complete GPT, in one file.

Dataset, tokenizer, autograd, attention, training, generation — ~300 lines of pure Python, zero dependencies. No PyTorch, no NumPy, no CUDA. Karpathy: "This is the full algorithmic content of what is needed. Everything else is just for efficiency."

TRAINS ON

32,033 names

Each name is one "document," wrapped in a BOS token that marks both start and end.

RUNS IN

~1 minute, CPU

1,000 training steps on a laptop. Loss falls from 3.3 (random) to 2.37.

WHY 3.3 → 2.37

ln(27) ≈ 3.30

Random guessing over 27 tokens. Dropping below it means it learned real letter structure.

The gap between this toy and ChatGPT is scale + engineering + post-training — data volume, learned tokenizers, distributed GPUs, RLHF — not a fundamentally different idea. Every model OpenAlice orchestrates is this, scaled. Seeing the 300 lines makes the whole stack legible.

02 / 08
micrograd → microGPT · the assembly line

Walk the pipeline, one piece at a time.

microGPT is built by adding exactly one idea at a time — Karpathy's train0…train5 ladder. Press Walk → to light each stage and watch the build assemble from a frequency table into a transformer.

FIG.02 — THE BUILD · TRAIN0…TRAIN5 · ONE NEW CONCEPT PER STAGE
stage 0 / 6
03 / 08
The clever bit · backprop in 30 lines

The autograd engine: every number remembers.

Each number is a Value that records not just its data but how it was computed. Call loss.backward() and the chain rule sweeps the whole graph — this is literally what PyTorch does, one scalar at a time.

class Value: .data # forward scalar .grad # ∂L/∂this (starts 0) ._children # the Values it came from ._local_grads # d(op)/d(each child)

backward() does two things: a topological sort so every node comes after what it depends on, then a reverse sweep pushing gradient into children — child.grad += local · node.grad.

THE #1 BACKPROP DETAIL

Gradients use +=, never =. When a value feeds several ops, gradients from every path must be summed. That's the multivariable chain rule, handled for free by accumulation.

FIG.03 — A WORKED SCALAR · FORWARD → BACKWARD
// a=2, b=3 · L = (a·b)² p = a·b = 6 L = p² = 36 // backward — chain rule ∂L/∂p = 2p = 12 ∂L/∂a = 12·b = 36 ∂L/∂b = 12·a = 24
stage 0 / 3
04 / 08
Where the model routes information

Attention: Q·K → softmax → weighted V.

Each token becomes a Query, a Key, and a Value. A position scores its query against every earlier key, softmaxes the scores into weights, and pulls a weighted blend of values. Click a token below to make it the query and watch where its attention flows.

FIG.04 — SELF-ATTENTION · CAUSAL · √dₖ-SCALED
QUERY = the picked token · keys ≤ query are visible (causal mask)
ATTENTION WEIGHTS softmax( q·kⱼ / √dₖ )
# score this position's query vs every cached key logit[t] = Σ q[j]·k_t[j] / √dₖ # turn scores into a probability over positions w = softmax(logits) # weighted blend of the cached values out = Σ w[t]·V[t]

The √dₖ scaling (here √4 = 2) keeps the dot-products from growing with dimension and saturating the softmax — the trick from Attention Is All You Need. Causality is automatic: a position only ever sees keys cached before it. Attention is routing — it decides which earlier tokens matter for predicting the next one.

n_head · 4 head_dim · 4 + residual
05 / 08
Attention mixes · the MLP thinks

The transformer block: a quartet.

Four parts make the block. RMSNorm keeps magnitudes sane, attention routes across positions, the MLP thinks per position, and residuals carry gradients straight back. microGPT runs exactly one block; frontier models stack a hundred-plus.

RMSNORM

magnitudes sane

scale = (mean(xᵢ²) + ε)−½; xᵢ ← xᵢ·scale. No parameters — just keeps activations well-behaved.

ATTENTION

routes across

Mixes information between positions — the only place tokens talk to each other. Then + residual.

MLP

thinks within

16→64→relu→16, each position independent. The nonlinear per-position compute. Then + residual.

FIG.05 — ONE BLOCK · x → +attn(rmsnorm x) → +mlp(rmsnorm x) · RESIDUALS DRAWN STRAIGHT
THE MLP · EXPAND 4× THEN CONTRACT
x = linear(x, mlp_fc1) # 16 → 64 x = [xi.relu() for xi in x] x = linear(x, mlp_fc2) # 64 → 16
TRAINING · ADAM + CROSS-ENTROPY
m = β₁·m + (1−β₁)·g # momentum v = β₂·v + (1−β₂)·g² # variance m̂ = m/(1−β₁ᵗ⁺¹) v̂ = v/(1−β₂ᵗ⁺¹) # bias-correct p −= lr_t · m̂ / (√v̂ + ε)

β₁=0.85 · β₂=0.99 · lr=0.01 with linear decay to zero. Real Adam (Kingma & Ba), not a toy.

06 / 08
Inference · the same loop that drives ChatGPT

Sampling & temperature.

Run the forward pass → get 27 logits → divide by temperature → softmax → draw a token weighted by those probabilities → feed it back. Drag the knob: <1 sharpens (safe, repetitive), >1 flattens (diverse, more typos).

FIG.06 — LIVE TEMPERATURE · softmax(logits / T) OVER A SAMPLE VOCAB
0.50

probs = softmax( logits / T )
token = random.choices(vocab, weights=probs)

press generate…

NEXT-TOKEN DISTRIBUTION · same logits, reshaped by T
07 / 08
Same algorithm · different scale

microGPT vs. a frontier LLM.

Side by side, the difference is stark in size and identical in idea. Hover a row. The algorithm is the same; the gap is data, engineering, and post-training.

FIG.07 — THE NUMBERS THAT MAKE IT CLICK
dimensionmicroGPTa frontier LLM
lines of code~300, no depsmillions, many frameworks
parameters4,19210¹¹ – 10¹²
vocab27 chars~100k learned subwords
confign_embd 16 · n_head 4 · n_layer 1thousands × tens × 100+
data32,033 namestrillions of tokens
training~1 min · MacBook CPUmonths · huge GPU fleets
loss3.3 → 2.37
HONEST CAVEAT

microGPT teaches pretraining only. No SFT, no RLHF, no tool use — the things that turn a base model into an assistant. Generating names ≠ language. The point is to see the machine, not to be good.

WHY THIS MATTERS FOR OPENALICE

Context window = block_size. KV cache = why long contexts cost memory. Temperature = Alice's per-chat diversity knob. You now reason about the stack because you've seen the toy version.

08 / 08 — you made it

You just built
an LLM.

An autograd engine. Attention. One block. Adam. Temperature sampling. The same loop, at enormous scale, is what drives ChatGPT — over ~100k subwords instead of 27 chars. You now hold the core algorithm.

01 Neural network from scratch · a neuron, a layer, a loss, backprop ✓ done
03 microGPT · a whole LLM in ~300 lines · same autograd, + attention ✓ complete
04 LLM from scratch · a 10M-param GPT trained on a laptop, end to end next
05 Transformers & attention · the architecture that made it all work locked
openalicelabs