OpenAlice Academy — 01 / Attention & Transformers

01 / 07

The idea, before any math

The word "it" looks back.

Read this: "The animal didn't cross the street because it was too tired." When you hit "it", your brain instantly knows it means animal, not street. You looked back and pulled in the relevant word. That act is attention.

FIG.02 — COREFERENCE · "IT" RESOLVES TO "ANIMAL"

Before 2017, sequences were read by an RNN — word by word, left to right, squeezing everything into one running memory. Two fatal flaws:

PROBLEM · 01

Strictly sequential

You can't touch word 50 until words 1–49 are done. That kills GPU parallelism.

PROBLEM · 02

Long-range memory leaks

By word 50, the signal from word 1 has passed through 49 update steps — mostly gone.

"Attention Is All You Need" (2017) made a radical bet: throw out recurrence entirely. Let every word look at every word, all at once. The title is the whole thesis.

A useful mental model: a soft dictionary lookup. Each word emits a Query (what it wants), a Key (what it offers), and a Value (the actual information). "it" sends a query that matches the key of "animal", so it pulls in "animal"'s value. That's the entire trick — the rest is making it precise.

02 / 07

Three roles for every token

Query, Key, Value.

From each token's embedding, three learned matrices project it into three roles. W_Q, W_K, W_V are the entire knowledge of an attention layer — everything it learns lives here.

A token is a vector x ∈ ℝ^d. Multiply it by three weight matrices and you get its query, key and value. In self-attention, Q, K, V all come from the same sequence — the words attend to each other.

Q = X · W_Q queries
K = X · W_K keys
V = X · W_V values

d_model · 512 PER HEAD · 64 LEARNED · W_Q W_K W_V

FIG.03 — ONE TOKEN PROJECTED INTO Q · K · V

03 / 07

The centerpiece · self-attention, live

Pick a word. Watch it attend.

This is the whole core equation, drawn. Hover or click a token — it becomes the query. Arcs flow to every other token (the keys), thickness = attention weight, and the output is the weighted blend of all the values. Toggle the heads to see each one specialize.

Attention(Q,K,V) = softmax( Q·Kᵀ / √dₖ ) · V

FIG.04 — SELF-ATTENTION · QUERY → KEYS → WEIGHTED VALUES hover a token ↓

Head — pick one, or blend all

Attention weights from —

Hover a token to make it the query. Each bar is softmax(q·kⱼ/√dₖ) — how much of token j's value gets blended into the output. The bars always sum to 1.00.

Full attention matrix (n × n)

Row i = queries, column j = keys. The bright diagonal-ish pattern is each token mixing context. Your selected query is the highlighted row.

04 / 07

The one constant that isn't magic

Why divide by √dₖ?

A dot product of two dₖ-dim vectors with unit-variance components has variance ≈ dₖ. With dₖ=64 the scores swing ±8 — too big. Drag dₖ below and watch the softmax saturate into a near-hard argmax, where gradients die.

FIG.05 — SOFTMAX, UNSCALED vs ÷√dₖ

var(q·k) ≈ dₖ → std ≈ √dₖ
÷ √dₖ → var back to ≈ 1

dₖ 64

THE FAILURE MODE

Large scores push softmax to a near-hard argmax — one weight ≈ 1, the rest ≈ 0. There the gradient is ≈ 0 and the layer stops learning. Scaling keeps softmax soft and gradients alive.

05 / 07

Many relationships at once

Multi-head: run it in parallel.

One attention gives one way to relate words. But "it" needs both what it refers to and its state. So run h heads in parallel, each with its own W_Q, W_K, W_V, each in a smaller subspace — then concatenate and mix with W_O.

MultiHead(Q,K,V) =
Concat(head₁,…,head_h) · W_O

head_i = Attention(Q·W_Qⁱ, K·W_Kⁱ, V·W_Vⁱ)

h · 8 HEADS d_model · 512 dₖ = dᵥ = 64

Because each head works in a 64-dim slice, 8 heads cost about the same as one full-width head. Empirically they specialize — one tracks syntax, one coreference, one the previous token. Try the head toggles back in the visualizer to feel it.

FIG.06 — 8 HEADS · EACH A SUBSPACE · CONCAT → W_O

06 / 07

The real thing

The transformer block.

A single attention layer isn't a transformer. The block wraps it: attention (tokens communicate) → a wide feed-forward net (each token thinks) → each wrapped in a residual + LayerNorm. Stack it N times and you have a model.

COMMUNICATE

Multi-head self-attention

Tokens exchange information — content-based routing. The mixing step.

COMPUTE

Position-wise FFN

A 2-layer MLP (4× wider, d_ff=2048) applied to each token alone. Where most knowledge lives.

STABILIZE

Residual + LayerNorm

x + Sublayer(x), normalized. A gradient highway so deep stacks actually train.

FIG.07 — ONE BLOCK · STACKED N TIMES

# one block, pre-norm (modern) x = x + MultiHeadAttn(LayerNorm(x), mask) x = x + FFN(LayerNorm(x)) # FFN — the "thinking" FFN(x) = max(0, x·W₁+b₁) · W₂ + b₂

THE THREE USES OF ATTENTION

1. Encoder self-attention — bidirectional.
2. Decoder masked self-attention — causal, can't see the future.
3. Cross-attention — decoder queries, encoder keys/values.

Modern LLMs (GPT, Claude, Llama) are decoder-only: just masked self-attention + FFN, stacked dozens to 100+ times, then a vocab-sized softmax for the next token.

FIG.08 — CAUSAL MASK · A TOKEN MAY ONLY ATTEND TO ITSELF AND THE PAST

For autoregressive generation, predicting word 5 while peeking at word 6 is cheating. So the upper-triangle of the score matrix is set to −∞ before softmax — those weights become 0. This single trick turns a transformer into a left-to-right language model.

Quadratic cost. Every token attends to every token → compute and memory scale as O(n²) in length n. Double the context, quadruple the cost — the bottleneck behind FlashAttention and every long-context trick.

07 / 07 — you made it

You just built
attention.

Q, K, V. A scaled dot product. A softmax. A weighted sum of values. Multi-head, residual, FFN, mask. That stack — repeated dozens of times — is exactly what lets an LLM predict the next word. You now hold the engine of modern AI.

01 Attention & Transformers · Q/K/V, scaled dot-product, multi-head, the block ✓ complete

02 Tokenization · what a "word" even is to the model · BPE from scratch next

03 Positional encoding · how the model knows word order · sinusoids & RoPE locked

04 microGPT · assemble a whole LLM from the pieces you now understand locked

NEXT · LESSON 02

Tokenization

WHAT A "WORD" REALLY IS

→

← The path

openalicelabs