OpenAlice Academy — 03 / microGPT — a whole LLM in ~300 lines

01 / 08

The one-sentence idea

A complete GPT, in one file.

Dataset, tokenizer, autograd, attention, training, generation — ~300 lines of pure Python, zero dependencies. No PyTorch, no NumPy, no CUDA. Karpathy: "This is the full algorithmic content of what is needed. Everything else is just for efficiency."

TRAINS ON

32,033 names

Each name is one "document," wrapped in a BOS token that marks both start and end.

RUNS IN

~1 minute, CPU

1,000 training steps on a laptop. Loss falls from 3.3 (random) to 2.37.

WHY 3.3 → 2.37

ln(27) ≈ 3.30

Random guessing over 27 tokens. Dropping below it means it learned real letter structure.

The gap between this toy and ChatGPT is scale + engineering + post-training — data volume, learned tokenizers, distributed GPUs, RLHF — not a fundamentally different idea. Every model OpenAlice orchestrates is this, scaled. Seeing the 300 lines makes the whole stack legible.

02 / 08

micrograd → microGPT · the assembly line

Walk the pipeline, one piece at a time.

microGPT is built by adding exactly one idea at a time — Karpathy's train0…train5 ladder. Press Walk → to light each stage and watch the build assemble from a frequency table into a transformer.

FIG.02 — THE BUILD · TRAIN0…TRAIN5 · ONE NEW CONCEPT PER STAGE

stage 0 / 6

03 / 08

The clever bit · backprop in 30 lines

The autograd engine: every number remembers.

Each number is a Value that records not just its data but how it was computed. Call loss.backward() and the chain rule sweeps the whole graph — this is literally what PyTorch does, one scalar at a time.

class Value: .data # forward scalar .grad # ∂L/∂this (starts 0) ._children # the Values it came from ._local_grads # d(op)/d(each child)

backward() does two things: a topological sort so every node comes after what it depends on, then a reverse sweep pushing gradient into children — child.grad += local · node.grad.

THE #1 BACKPROP DETAIL

Gradients use +=, never =. When a value feeds several ops, gradients from every path must be summed. That's the multivariable chain rule, handled for free by accumulation.

FIG.03 — A WORKED SCALAR · FORWARD → BACKWARD

// a=2, b=3 · L = (a·b)² p = a·b = 6 L = p² = 36 // backward — chain rule ∂L/∂p = 2p = 12 ∂L/∂a = 12·b = 36 ∂L/∂b = 12·a = 24

stage 0 / 3

04 / 08

Where the model routes information

Attention: Q·K → softmax → weighted V.

Each token becomes a Query, a Key, and a Value. A position scores its query against every earlier key, softmaxes the scores into weights, and pulls a weighted blend of values. Click a token below to make it the query and watch where its attention flows.

FIG.04 — SELF-ATTENTION · CAUSAL · √dₖ-SCALED

QUERY = the picked token · keys ≤ query are visible (causal mask)

ATTENTION WEIGHTS softmax( q·kⱼ / √dₖ )

# score this position's query vs every cached key logit[t] = Σ q[j]·k_t[j] / √dₖ # turn scores into a probability over positions w = softmax(logits) # weighted blend of the cached values out = Σ w[t]·V[t]

The √dₖ scaling (here √4 = 2) keeps the dot-products from growing with dimension and saturating the softmax — the trick from Attention Is All You Need. Causality is automatic: a position only ever sees keys cached before it. Attention is routing — it decides which earlier tokens matter for predicting the next one.

n_head · 4 head_dim · 4 + residual

05 / 08

Attention mixes · the MLP thinks

The transformer block: a quartet.

Four parts make the block. RMSNorm keeps magnitudes sane, attention routes across positions, the MLP thinks per position, and residuals carry gradients straight back. microGPT runs exactly one block; frontier models stack a hundred-plus.

RMSNORM

magnitudes sane

scale = (mean(xᵢ²) + ε)^−½; xᵢ ← xᵢ·scale. No parameters — just keeps activations well-behaved.

ATTENTION

routes across

Mixes information between positions — the only place tokens talk to each other. Then + residual.

MLP

thinks within

16→64→relu→16, each position independent. The nonlinear per-position compute. Then + residual.

FIG.05 — ONE BLOCK · x → +attn(rmsnorm x) → +mlp(rmsnorm x) · RESIDUALS DRAWN STRAIGHT

THE MLP · EXPAND 4× THEN CONTRACT

x = linear(x, mlp_fc1) # 16 → 64 x = [xi.relu() for xi in x] x = linear(x, mlp_fc2) # 64 → 16

TRAINING · ADAM + CROSS-ENTROPY

m = β₁·m + (1−β₁)·g # momentum v = β₂·v + (1−β₂)·g² # variance m̂ = m/(1−β₁ᵗ⁺¹) v̂ = v/(1−β₂ᵗ⁺¹) # bias-correct p −= lr_t · m̂ / (√v̂ + ε)

β₁=0.85 · β₂=0.99 · lr=0.01 with linear decay to zero. Real Adam (Kingma & Ba), not a toy.

06 / 08

Inference · the same loop that drives ChatGPT

Sampling & temperature.

Run the forward pass → get 27 logits → divide by temperature → softmax → draw a token weighted by those probabilities → feed it back. Drag the knob: <1 sharpens (safe, repetitive), >1 flattens (diverse, more typos).

FIG.06 — LIVE TEMPERATURE · softmax(logits / T) OVER A SAMPLE VOCAB

temperature T 0.50

probs = softmax( logits / T )
token = random.choices(vocab, weights=probs)

press generate…

NEXT-TOKEN DISTRIBUTION · same logits, reshaped by T

07 / 08

Same algorithm · different scale

microGPT vs. a frontier LLM.

Side by side, the difference is stark in size and identical in idea. Hover a row. The algorithm is the same; the gap is data, engineering, and post-training.

FIG.07 — THE NUMBERS THAT MAKE IT CLICK

dimension	microGPT	a frontier LLM
lines of code	~300, no deps	millions, many frameworks
parameters	4,192	10¹¹ – 10¹²
vocab	27 chars	~100k learned subwords
config	n_embd 16 · n_head 4 · n_layer 1	thousands × tens × 100+
data	32,033 names	trillions of tokens
training	~1 min · MacBook CPU	months · huge GPU fleets
loss	3.3 → 2.37	—

HONEST CAVEAT

microGPT teaches pretraining only. No SFT, no RLHF, no tool use — the things that turn a base model into an assistant. Generating names ≠ language. The point is to see the machine, not to be good.

WHY THIS MATTERS FOR OPENALICE

Context window = block_size. KV cache = why long contexts cost memory. Temperature = Alice's per-chat diversity knob. You now reason about the stack because you've seen the toy version.

08 / 08 — you made it

You just built
an LLM.

An autograd engine. Attention. One block. Adam. Temperature sampling. The same loop, at enormous scale, is what drives ChatGPT — over ~100k subwords instead of 27 chars. You now hold the core algorithm.

01 Neural network from scratch · a neuron, a layer, a loss, backprop ✓ done

03 microGPT · a whole LLM in ~300 lines · same autograd, + attention ✓ complete

04 LLM from scratch · a 10M-param GPT trained on a laptop, end to end next

05 Transformers & attention · the architecture that made it all work locked

← The path

Next · 00 · 04