openalicelabs / academy
COURSE ARCH-00 LESSON 00 · 04 TOPIC LLM FROM SCRATCH EST. READ ~13 MIN
OPENALICE LABORATORIES · EDUCATION PATH · ARCHITECTURE 00 · 04

A real GPT,
trained on
a laptop.

microGPT showed the math is small. This shows the engineering is reachable: a ~10.8M-parameter GPT, written component by component in PyTorch, trained on 1 MB of Shakespeare to coherent text in ~45 minutes — every knob (context, heads, warmup, temperature) implemented by your own hand.

FIG.00 — THE BUILD ARC
loading…
FIG.0A — ONE OBJECTIVE · predict the next token, append it, repeat

The whole model is one function: given a window of past tokens, output a probability over the next one. Train that by showing it text and asking it to guess. Sample from it in a loop and it writes.

MODEL~10.8M-param char-level GPT (GPT-2 lineage)
CONFIGn_layer=6 · n_head=6 · n_embd=384 · block=256
DATA~1 MB Tiny-Shakespeare · vocab = 65 chars
HARDWAREone laptop GPU — MPS → CUDA → CPU
RUN~45 min on an M3 Pro · PyTorch + NumPy + tqdm
01 / 06
Part 1 · the bijection · text → integers

Start with a character vocabulary.

A model never sees text — it sees integers. The simplest tokenizer maps every distinct character to an ID. Type below and watch the char-vocab grow, encode, and round-trip exactly.

FIG.01 — LIVE CHAR VOCAB · ENCODE / DECODE
encode(str) → char IDs · hover a chip

UNIQUE CHARS (this text)
SEQUENCE LENGTH
ROUND-TRIP
chars = sorted(set(open(data).read())) vocab_size = len(chars) # 65 for Tiny-Shakespeare stoi = {c:i for i,c in enumerate(chars)} encode = lambda s: [stoi[c] for c in s] decode = lambda ids: "".join(itos[i] for i in ids)

Why characters and not subword BPE? On a ~1 MB corpus there are only ~4,225 possible bigrams, so every bigram appears many times — dense statistics a tiny model can actually learn. Swap in GPT-2's 50,257-token BPE and most pairs are too rare to estimate.

The reported gap is stark: BPE training loss stalls around ~6.3 on Shakespeare, while char-level reaches ~1.5. The cost is ~3× longer sequences — the right trade only because the data is small.

02 / 06
Part 2 · the transformer · by hand

Stack six identical blocks.

Token IDs become vectors, get a position added, then flow through 6 identical blocks of attention + MLP, each wrapped in pre-norm and a residual highway. Hover the stack to trace where the ~10.8M parameters live.

FIG.02 — GPT FORWARD PASS · ids → logits

SELECTED COMPONENTtoken + position embed
APPROX. PARAMETERS~123K
# one block — pre-norm + residual x = x + attn(LayerNorm(x)) # mix across positions x = x + mlp(LayerNorm(x)) # think per position

Embeddings. Two tables — token (wte, 65×384) and learned absolute position (wpe, 256×384) — are summed. The token table is weight-tied to the output head, forcing input and output representations consistent (and saving ~25K params).

MLP. Per block: Linear(384→1536)GELULinear(1536→384) — the canonical 4× expansion. GELU is a smooth gate (no hard zero cutoff) that helps gradients flow.

Pre-norm (LayerNorm before each sublayer, GPT-2 style) stabilizes deep training; the residual highway lets gradients reach early layers undecayed. After 6 blocks: a final LayerNorm, then the tied head → 65 logits per position.

03 / 06
Inside a block · the mixing operation

Causal attention, visualized.

Attention lets each position read from earlier ones — but only earlier. Position i may attend to 0…i, never the future. Drag the query position and watch the causal mask and softmax weights light up.

FIG.03 — ATTENTION WEIGHTS · ROW = WHO QUERY i READS
QUERY i 4
attn = softmax( (Q @ Kᵀ) / √d ) # d=64 out = attn @ V

One linear projects each vector to Q, K, V, then 384 dims reshape into 6 heads of 64. Each head scores how much query i wants each key j, softmaxes those scores into weights, and mixes the values.

The 1/√d scale keeps dot products from growing with dimension and saturating the softmax. Causality (is_causal=True) rides PyTorch's fused FlashAttention path — faster and more memory-frugal than materializing a triangular mask.

Heads run in parallel so they specialize: one might track which vowels follow consonants, another line-break patterns. Outputs concatenate back to 384 and pass an output projection.

04 / 06
Part 3 · making it learn stably

Watch the loss curve — and the overfit cliff.

The objective is next-token cross-entropy; the extras (AdamW, grad-clip, warmup→cosine LR) are what make it train, not just exist. Run training and watch val loss fall — then start memorizing.

FIG.04 — TRAINING CURVE · TRAIN vs VAL LOSS
step 0 / 5000

TRAIN LOSS
VAL LOSS
LEARNING RATE
BEST VAL @ STEP
# LR: linear warmup, then cosine decay if step < warmup: # warmup=100 lr = max_lr * step / warmup else: p = (step-warmup)/(max-warmup) lr = min_lr + 0.5*(max_lr-min_lr)*(1+cos(π*p))

AdamW (lr=1e-3, wd=0.01), grad-clip at global norm 1.0 to cap spikes, and warmup→cosine: big steps early to explore, small steps late to refine. Each step sees 64×256 = 16,384 supervised next-char targets — that density is why tiny models learn fast.

The honest punchline: best val loss (~1.57–1.64) lands around step 1,500–2,500. After that the 10.8M model overfits 1 MB and memorizes — the right move is early stopping, not all 5,000 steps. The overfit cliff is itself the lesson in data-vs-parameters.

05 / 06
Part 4 · from predictor to writer

Two dials reshape the output.

A next-token model becomes a writer by sampling in a loop. Temperature and top-k act on the logits before softmax — runtime knobs, no retraining. Slide them and watch the next-char distribution morph.

FIG.05 — NEXT-CHAR DISTRIBUTION · TEMPERATURE & TOP-K
TEMP 0.8
TOP-K 10
logits = logits[:, -1, :] / temperature if top_k: # keep k highest v,_ = torch.topk(logits, top_k) logits[logits < v[:,-1:]] = -inf probs = softmax(logits) idx = multinomial(probs, 1) # sample — not argmax

Temperature rescales logits. T→0 → greedy & repetitive; T=1 → the raw distribution; T>1 flattens it, lifting rare tokens (creative but incoherent). The doc's sweet spot is T ≈ 0.7–0.9.

Top-k masks all but the k highest logits to −∞, truncating the unreliable long tail before sampling (k≈40 for the 65-char vocab). And we sample, not argmax — greedy decoding loops; sampling respects confidence while keeping variety.

FIG.5B — AUTOREGRESSIVE WRITER · predict → append → repeat (toy bigram on Shakespeare stats)
temperature follows the slider above · context cropped to block_size
06 / 06
Parts 5–6 · feel the scaling laws

Three sizes — and what's missing.

Part 5 wires it together; Part 6 scales model and data on the same laptop so you feel the curve. The three preset configs are byte-for-byte nanoGPT's lineage.

TINY · ~0.5M

2L / 2H / 128D

~5 min. Learns letter shapes and spacing, babbles non-words. The fastest "is my loop even working?" check.

SMALL · ~4M

4L / 4H / 256D

~20 min. Real words and short phrases emerge; the grammar starts to feel Shakespearean.

MEDIUM · ~10M ★

6L / 6H / 384D

~45 min. Coherent (if meaningless) Shakespeare-flavored verse. The default — nanoGPT's char demo config, made literal.

This workshop builds…but a 2026 LLM usesWhy the gap matters
Learned absolute positionsRoPE (rotary)relative positions generalize to longer context
Vanilla LayerNormRMSNormcheaper, no mean-subtraction needed
GELU MLPSwiGLU / gated MLPgating buys quality per parameter
Dense attentionGQA / MQA, sliding-windowshrinks the KV-cache for long contexts
Single-host fp32-ishmixed precision, FSDP / tensor-parallelthe difference between a laptop and a cluster
IT IS A LEARNING ARTIFACT, NOT A PRODUCT

~10.8M params on 1 MB produces grammatical-looking Shakespeare noise — no semantics, no facts, no instruction-following. Nothing here transfers to "deploy a chatbot." The value isn't the babbler; it's that you implemented every knob by hand.

Architecturally this is a 2019/2022-era GPT-2, not a modern stack — but that's exactly why it's a clear teaching object. Char-level caps the ceiling on purpose; don't read its weakness as a Transformer weakness.

THE HIGHEST-LEVERAGE LESSON

The "real-trainer extras" — AdamW, grad-clip, warmup-cosine, pre-norm, residuals, top-k — are precisely what microGPT keeps minimal. They convert a bare algorithm into something numerically stable that samples coherently. That conversion is the whole point.

Sampling is a runtime dial, not retraining: the same weights are "boring" at T=0.1 or "unhinged" at T=1.5. That is the exact lever an orchestrator — or an LLM council member — tunes per call.

00 · 04 — you made it

You trained
an LLM.

Char vocab, the forward pass, causal attention, a stable training loop, sampling, and scale. A working GPT is a laptop-hour away — and you wrote every line. You now understand a model as a system, not a vending machine.

00·02 Math for ML · the vectors, matrices and gradients you actually use ✓ done
00·03 microGPT · the whole algorithm in ~200 dependency-free lines ✓ done
00·04 LLM from Scratch · a real 10M-param GPT, trained end to end ✓ complete
01·01 Attention & Transformers · the operation that changed everything, in depth next
Next · 01 · 01

Attention & Transformers →

You wired attention into a working model. Now slow down and really understand query, key, value — the operation underneath all of it.

openalicelabs