One architecture ate machine learning. GPT, Claude, Llama — all transformers. Their engine is a single idea: let each token attend to every other token, in parallel, and pull in what it needs. Finish this page and self-attention stops being magic.
loading…
No recurrence, no one-word-at-a-time. All-pairs, all at once. Any token can reach any other in a single hop — distance is free. That is why transformers parallelize on GPUs and why they scaled to the models we have today.
Read this: "The animal didn't cross the street because it was too tired." When you hit "it", your brain instantly knows it means animal, not street. You looked back and pulled in the relevant word. That act is attention.
Before 2017, sequences were read by an RNN — word by word, left to right, squeezing everything into one running memory. Two fatal flaws:
You can't touch word 50 until words 1–49 are done. That kills GPU parallelism.
By word 50, the signal from word 1 has passed through 49 update steps — mostly gone.
"Attention Is All You Need" (2017) made a radical bet: throw out recurrence entirely. Let every word look at every word, all at once. The title is the whole thesis.
A useful mental model: a soft dictionary lookup. Each word emits a Query (what it wants), a Key (what it offers), and a Value (the actual information). "it" sends a query that matches the key of "animal", so it pulls in "animal"'s value. That's the entire trick — the rest is making it precise.
From each token's embedding, three learned matrices project it into three roles. W_Q, W_K, W_V are the entire knowledge of an attention layer — everything it learns lives here.
A token is a vector x ∈ ℝ^d. Multiply it by three weight matrices and you get its query, key and value. In self-attention, Q, K, V all come from the same sequence — the words attend to each other.
This is the whole core equation, drawn. Hover or click a token — it becomes the query. Arcs flow to every other token (the keys), thickness = attention weight, and the output is the weighted blend of all the values. Toggle the heads to see each one specialize.
A dot product of two dₖ-dim vectors with unit-variance components has variance ≈ dₖ. With dₖ=64 the scores swing ±8 — too big. Drag dₖ below and watch the softmax saturate into a near-hard argmax, where gradients die.
Large scores push softmax to a near-hard argmax — one weight ≈ 1, the rest ≈ 0. There the gradient is ≈ 0 and the layer stops learning. Scaling keeps softmax soft and gradients alive.
One attention gives one way to relate words. But "it" needs both what it refers to and its state. So run h heads in parallel, each with its own W_Q, W_K, W_V, each in a smaller subspace — then concatenate and mix with W_O.
Because each head works in a 64-dim slice, 8 heads cost about the same as one full-width head. Empirically they specialize — one tracks syntax, one coreference, one the previous token. Try the head toggles back in the visualizer to feel it.
A single attention layer isn't a transformer. The block wraps it: attention (tokens communicate) → a wide feed-forward net (each token thinks) → each wrapped in a residual + LayerNorm. Stack it N times and you have a model.
Tokens exchange information — content-based routing. The mixing step.
A 2-layer MLP (4× wider, d_ff=2048) applied to each token alone. Where most knowledge lives.
x + Sublayer(x), normalized. A gradient highway so deep stacks actually train.
1. Encoder self-attention — bidirectional.
2. Decoder masked self-attention — causal, can't see the future.
3. Cross-attention — decoder queries, encoder keys/values.
Modern LLMs (GPT, Claude, Llama) are decoder-only: just masked self-attention + FFN, stacked dozens to 100+ times, then a vocab-sized softmax for the next token.
For autoregressive generation, predicting word 5 while peeking at word 6 is cheating. So the upper-triangle of the score matrix is set to −∞ before softmax — those weights become 0. This single trick turns a transformer into a left-to-right language model.
Q, K, V. A scaled dot product. A softmax. A weighted sum of values. Multi-head, residual, FFN, mask. That stack — repeated dozens of times — is exactly what lets an LLM predict the next word. You now hold the engine of modern AI.