openalicelabs / academy
COURSE LLM-101 LESSON 01 / 01 EST. READ ~11 MIN LIGHT · v0.1
OPENALICE LABORATORIES · EDUCATION PATH · RUNG 01

Every word
looks at
every word.

One architecture ate machine learning. GPT, Claude, Llama — all transformers. Their engine is a single idea: let each token attend to every other token, in parallel, and pull in what it needs. Finish this page and self-attention stops being magic.

FIG.00 — RECURSIVE MOTIF
loading…
FIG.0A — A SENTENCE ATTENDING TO ITSELF · EVERY TOKEN, EVERY OTHER TOKEN · ONE STEP

No recurrence, no one-word-at-a-time. All-pairs, all at once. Any token can reach any other in a single hop — distance is free. That is why transformers parallelize on GPUs and why they scaled to the models we have today.

FIG.01 — SCALED DOT-PRODUCT ATTENTION · Q·Kᵀ → ÷√dₖ → SOFTMAX → ·V
QUERYq — "here's what I'm looking for"
KEYk — "here's what I contain / match on"
VALUEv — "here's the info I'll hand over"
COREsoftmax(Q·Kᵀ / √dₖ) · V
01 / 07
The idea, before any math

The word "it" looks back.

Read this: "The animal didn't cross the street because it was too tired." When you hit "it", your brain instantly knows it means animal, not street. You looked back and pulled in the relevant word. That act is attention.

FIG.02 — COREFERENCE · "IT" RESOLVES TO "ANIMAL"

Before 2017, sequences were read by an RNN — word by word, left to right, squeezing everything into one running memory. Two fatal flaws:

PROBLEM · 01

Strictly sequential

You can't touch word 50 until words 1–49 are done. That kills GPU parallelism.

PROBLEM · 02

Long-range memory leaks

By word 50, the signal from word 1 has passed through 49 update steps — mostly gone.

"Attention Is All You Need" (2017) made a radical bet: throw out recurrence entirely. Let every word look at every word, all at once. The title is the whole thesis.

A useful mental model: a soft dictionary lookup. Each word emits a Query (what it wants), a Key (what it offers), and a Value (the actual information). "it" sends a query that matches the key of "animal", so it pulls in "animal"'s value. That's the entire trick — the rest is making it precise.

02 / 07
Three roles for every token

Query, Key, Value.

From each token's embedding, three learned matrices project it into three roles. W_Q, W_K, W_V are the entire knowledge of an attention layer — everything it learns lives here.

A token is a vector x ∈ ℝ^d. Multiply it by three weight matrices and you get its query, key and value. In self-attention, Q, K, V all come from the same sequence — the words attend to each other.

Q = X · W_Q  queries
K = X · W_K  keys
V = X · W_V  values
d_model · 512 PER HEAD · 64 LEARNED · W_Q W_K W_V
FIG.03 — ONE TOKEN PROJECTED INTO Q · K · V
03 / 07
The centerpiece · self-attention, live

Pick a word. Watch it attend.

This is the whole core equation, drawn. Hover or click a token — it becomes the query. Arcs flow to every other token (the keys), thickness = attention weight, and the output is the weighted blend of all the values. Toggle the heads to see each one specialize.

Attention(Q,K,V) = softmax( Q·Kᵀ / √dₖ ) · V
FIG.04 — SELF-ATTENTION · QUERY → KEYS → WEIGHTED VALUES hover a token ↓
Head — pick one, or blend all
Attention weights from
Hover a token to make it the query. Each bar is softmax(q·kⱼ/√dₖ) — how much of token j's value gets blended into the output. The bars always sum to 1.00.
Full attention matrix (n × n)
Row i = queries, column j = keys. The bright diagonal-ish pattern is each token mixing context. Your selected query is the highlighted row.
04 / 07
The one constant that isn't magic

Why divide by √dₖ?

A dot product of two dₖ-dim vectors with unit-variance components has variance ≈ dₖ. With dₖ=64 the scores swing ±8 — too big. Drag dₖ below and watch the softmax saturate into a near-hard argmax, where gradients die.

FIG.05 — SOFTMAX, UNSCALED vs ÷√dₖ
var(q·k) ≈ dₖ  →  std ≈ √dₖ
÷ √dₖ  →  var back to ≈ 1
64
THE FAILURE MODE

Large scores push softmax to a near-hard argmax — one weight ≈ 1, the rest ≈ 0. There the gradient is ≈ 0 and the layer stops learning. Scaling keeps softmax soft and gradients alive.

05 / 07
Many relationships at once

Multi-head: run it in parallel.

One attention gives one way to relate words. But "it" needs both what it refers to and its state. So run h heads in parallel, each with its own W_Q, W_K, W_V, each in a smaller subspace — then concatenate and mix with W_O.

MultiHead(Q,K,V) =
  Concat(head₁,…,head_h) · W_O

head_i = Attention(Q·W_Qⁱ, K·W_Kⁱ, V·W_Vⁱ)
h · 8 HEADS d_model · 512 dₖ = dᵥ = 64

Because each head works in a 64-dim slice, 8 heads cost about the same as one full-width head. Empirically they specialize — one tracks syntax, one coreference, one the previous token. Try the head toggles back in the visualizer to feel it.

FIG.06 — 8 HEADS · EACH A SUBSPACE · CONCAT → W_O
06 / 07
The real thing

The transformer block.

A single attention layer isn't a transformer. The block wraps it: attention (tokens communicate) → a wide feed-forward net (each token thinks) → each wrapped in a residual + LayerNorm. Stack it N times and you have a model.

COMMUNICATE

Multi-head self-attention

Tokens exchange information — content-based routing. The mixing step.

COMPUTE

Position-wise FFN

A 2-layer MLP (4× wider, d_ff=2048) applied to each token alone. Where most knowledge lives.

STABILIZE

Residual + LayerNorm

x + Sublayer(x), normalized. A gradient highway so deep stacks actually train.

FIG.07 — ONE BLOCK · STACKED N TIMES
# one block, pre-norm (modern) x = x + MultiHeadAttn(LayerNorm(x), mask) x = x + FFN(LayerNorm(x)) # FFN — the "thinking" FFN(x) = max(0, x·W₁+b₁) · W₂ + b₂
THE THREE USES OF ATTENTION

1. Encoder self-attention — bidirectional.
2. Decoder masked self-attention — causal, can't see the future.
3. Cross-attention — decoder queries, encoder keys/values.

Modern LLMs (GPT, Claude, Llama) are decoder-only: just masked self-attention + FFN, stacked dozens to 100+ times, then a vocab-sized softmax for the next token.

FIG.08 — CAUSAL MASK · A TOKEN MAY ONLY ATTEND TO ITSELF AND THE PAST

For autoregressive generation, predicting word 5 while peeking at word 6 is cheating. So the upper-triangle of the score matrix is set to −∞ before softmax — those weights become 0. This single trick turns a transformer into a left-to-right language model.

Quadratic cost. Every token attends to every token → compute and memory scale as O(n²) in length n. Double the context, quadruple the cost — the bottleneck behind FlashAttention and every long-context trick.
07 / 07 — you made it

You just built
attention.

Q, K, V. A scaled dot product. A softmax. A weighted sum of values. Multi-head, residual, FFN, mask. That stack — repeated dozens of times — is exactly what lets an LLM predict the next word. You now hold the engine of modern AI.

01 Attention & Transformers · Q/K/V, scaled dot-product, multi-head, the block ✓ complete
02 Tokenization · what a "word" even is to the model · BPE from scratch next
03 Positional encoding · how the model knows word order · sinusoids & RoPE locked
04 microGPT · assemble a whole LLM from the pieces you now understand locked
NEXT · LESSON 02
Tokenization
WHAT A "WORD" REALLY IS
openalicelabs