Attention is order-blind. Shuffle the words and the math gives the exact same answer — it sees a bag, not a sequence. Positional encoding is the fix: a way to stamp each token with where it is, so "dog bites man" stops meaning the same thing as "man bites dog". The modern stamp is RoPE — and it's just rotation.
loading…
Take a token's query/key vector and split it into 2D pairs. For a token at position m, spin each pair by an angle m·θ. Position 0 doesn't turn; position 5 turns five times as far. The meaning is in the vector; the position is in the angle.
Self-attention is a weighted sum over all tokens. A sum doesn't care about order — shuffle the inputs and you just shuffle the outputs, the values are identical. Press shuffle and watch the attention output for the word "it" stay byte-for-byte the same no matter where the words sit.
The dot products, the softmax weights, the value blend — none of them reference a token's index. So "dog bites man" and "man bites dog" produce the identical set of token representations. For a language model, that's a catastrophe.
We have to inject order from the outside. The only question left is how — and that question is this entire lesson.
"Attention Is All You Need" gave every position its own pattern by stacking sine and cosine waves at many frequencies — fast waves for the low dimensions, slow ones for the high. Each position gets a unique, smoothly-varying fingerprint. Drag the position and read one off.
The leftmost dimensions oscillate fast — they flip between adjacent positions, encoding fine local order. The rightmost dimensions barely move across the whole sequence — they encode coarse, long-range position. Together: a binary-clock-like code where every position is distinct.
It's fixed, not learned — no parameters — and the wavelengths form a geometric progression from 2π up to about 10000·2π. This vector is then simply added to the token's embedding before the first attention layer.
Adding a position vector has a wart: it mixes "what a token means" with "where it is" inside the same numbers, and it encodes absolute position — yet language mostly cares about relative distance ("the adjective two words back"). RoPE's insight: don't add anything. Rotate the query and key by an angle proportional to position.
A fixed vector is added to the embedding. Simple, but absolute, and it entangles content with position. Struggles to extrapolate past the trained length.
The vector's length (its meaning) is untouched; only its angle changes with position. Applied to Q and K, so it shows up only inside attention.
Rotate the query at m and the key at n, take their dot product — the angles subtract, leaving a function of m − n only.
A 2D rotation by angle mθ is the matrix [cos −sin; sin cos]. It never scales — the norm is identical before and after. So RoPE cannot distort how strongly two tokens match on content; it only adds a clean, position-dependent phase.
That length-preservation is exactly why you can apply it to Q and K without retraining the rest of the network into chaos. Meaning in, meaning out — plus a where.
Here is RoPE itself. A vector is split into 2D pairs; pair k spins at its own frequency θₖ = 10000^(−2k/d). Drag the position and every dial turns by position × θₖ — fast dials race, slow dials crawl. That bouquet of angles is the position code.
Watch the dials: the first pair (high frequency) sweeps all the way around as you move just a few steps — it resolves nearby order. The last pair barely twitches over the whole window — it carries long-range position.
Same geometric spread of frequencies as the sinusoids — but now applied as a rotation of Q and K instead of an addition to the embedding. Hit Walk the sequence to feel position scrolling through the dials.
This is the payoff. Rotate query q at position m and key k at position n, then dot them. The two rotations combine into one rotation by (m − n) — the absolute positions cancel. Drag both tokens and watch the score depend only on the gap between them.
Slide both sliders so the gap stays the same — say m=2,n=8 then m=10,n=16 — and the score barely moves: it's pinned to m − n = −6, not to where the pair sits in the sequence. The dot at the curve's marker is exactly that score.
This is relative position for free — no extra parameters, no bias table. And because a far-apart pair has spun the high-frequency dials many times, distant tokens naturally decorrelate: a soft, built-in distance decay.
RoPE is one of several ways to teach a transformer about order. They trade off extrapolation, cost, and simplicity.
| Scheme | How position enters | Abs / Rel | Extrapolates? | Used by |
|---|---|---|---|---|
| Sinusoidal | fixed waves added to embedding | absolute | weakly | orig. Transformer (2017) |
| Learned absolute | a trainable vector per position | absolute | no — hard cap at trained length | BERT, GPT-2 |
| ALiBi | linear distance penalty on scores | relative | yes, gracefully | BLOOM, MPT |
| RoPE | rotate Q and K by m·θ | relative (via rotation) | yes — with scaling tricks | LLaMA, Mistral, Qwen, DeepSeek, Gemma |
A model trained to position 2048 has never seen the angles for position 8000. The fast dials are at wildly unfamiliar rotations, so attention degrades. RoPE extrapolates better than learned tables — but not for free.
The fixes are clever angle hacks: NTK / linear scaling and YaRN stretch or re-base the frequencies so a model trained short can be served long. This is the engine behind almost every "128k context" headline you've read.
Position isn't a layer you can swap at will — the base frequency 10000 is baked in at pre-training. RoPE touches only Q and K, never V, so values stay pure content. And it adds essentially zero parameters: it's geometry, not weights.
Position is still unsolved-feeling: sinusoids → learned → relative → ALiBi → RoPE → YaRN, and length-extrapolation remains finicky. The field hasn't found the last word — it's found the current best one.
Attention is order-blind. Sinusoids stamped position by addition; RoPE stamps it by rotation — preserving meaning, encoding relative distance for free, and extrapolating further. You now know exactly why "dog bites man" can mean something to a transformer at all. You hold the model's sense of where.
Now that every token knows where it sits, watch them all look at each other at once. Q, K, V, softmax — the whole engine.