openalicelabs / academy
COURSE ARCH-01 LESSON 01 · 04 TOPIC POSITIONAL · RoPE EST. READ ~12 MIN
OPENALICE LABORATORIES · EDUCATION PATH · ARCHITECTURE 01 · 04

Where each
token
sits.

Attention is order-blind. Shuffle the words and the math gives the exact same answer — it sees a bag, not a sequence. Positional encoding is the fix: a way to stamp each token with where it is, so "dog bites man" stops meaning the same thing as "man bites dog". The modern stamp is RoPE — and it's just rotation.

FIG.00 — ROTATE BY POSITION
loading…
FIG.0A — THE WHOLE IDEA · a token's vector is ROTATED by an angle = position × frequency

Take a token's query/key vector and split it into 2D pairs. For a token at position m, spin each pair by an angle m·θ. Position 0 doesn't turn; position 5 turns five times as far. The meaning is in the vector; the position is in the angle.

PROBLEMself-attention is permutation-equivariant — order-blind
2017 FIXADD a fixed sinusoid to each embedding
MODERN FIXRoPE — ROTATE Q and K by position
KEY PROPERTYq·k ends up depending on RELATIVE distance (m − n)
USED BYLLaMA, GPT-NeoX, Mistral, Qwen, DeepSeek, Gemma
01 / 06
The problem · permutation equivariance

Attention sees a bag, not a line.

Self-attention is a weighted sum over all tokens. A sum doesn't care about order — shuffle the inputs and you just shuffle the outputs, the values are identical. Press shuffle and watch the attention output for the word "it" stay byte-for-byte the same no matter where the words sit.

FIG.01 — SHUFFLE THE SENTENCE · ATTENTION OUTPUT UNCHANGED
SEQUENCE (no positions) — the query token is highlighted

ATTENTION OUTPUT FOR "it" · ‖z‖
VECTOR z (rounded)
CHANGED BY SHUFFLE?
Attention(Q,K,V) = softmax( Q·Kᵀ / √d ) · V // a weighted SUM over every token // permute the tokens P … Attn(PX) = P · Attn(X) // same set, just reordered

The dot products, the softmax weights, the value blend — none of them reference a token's index. So "dog bites man" and "man bites dog" produce the identical set of token representations. For a language model, that's a catastrophe.

We have to inject order from the outside. The only question left is how — and that question is this entire lesson.

02 / 06
The 2017 answer · fixed sinusoids

The original fix: a fingerprint of waves.

"Attention Is All You Need" gave every position its own pattern by stacking sine and cosine waves at many frequencies — fast waves for the low dimensions, slow ones for the high. Each position gets a unique, smoothly-varying fingerprint. Drag the position and read one off.

FIG.02 — THE POSITIONAL-ENCODING MATRIX · pos × dimension
pos 0 pos 63
SAMPLED ROW — PE(pos = 6) across dimensions
PE(pos, 2i) = sin( pos / 10000^(2i/d) ) PE(pos, 2i+1) = cos( pos / 10000^(2i/d) )

The leftmost dimensions oscillate fast — they flip between adjacent positions, encoding fine local order. The rightmost dimensions barely move across the whole sequence — they encode coarse, long-range position. Together: a binary-clock-like code where every position is distinct.

It's fixed, not learned — no parameters — and the wavelengths form a geometric progression from up to about 10000·2π. This vector is then simply added to the token's embedding before the first attention layer.

03 / 06
The shift in thinking · absolute → relative

Stop adding. Start rotating.

Adding a position vector has a wart: it mixes "what a token means" with "where it is" inside the same numbers, and it encodes absolute position — yet language mostly cares about relative distance ("the adjective two words back"). RoPE's insight: don't add anything. Rotate the query and key by an angle proportional to position.

ADDITIVE (2017)

position bolted on

A fixed vector is added to the embedding. Simple, but absolute, and it entangles content with position. Struggles to extrapolate past the trained length.

ROTARY · RoPE ★

position in the angle

The vector's length (its meaning) is untouched; only its angle changes with position. Applied to Q and K, so it shows up only inside attention.

THE PAYOFF

relative for free

Rotate the query at m and the key at n, take their dot product — the angles subtract, leaving a function of m − n only.

FIG.03 — A ROTATION PRESERVES LENGTH · only the direction turns

A 2D rotation by angle is the matrix [cos −sin; sin cos]. It never scales — the norm is identical before and after. So RoPE cannot distort how strongly two tokens match on content; it only adds a clean, position-dependent phase.

That length-preservation is exactly why you can apply it to Q and K without retraining the rest of the network into chaos. Meaning in, meaning out — plus a where.

04 / 06
The core mechanism · live rotary dials

Spin the RoPE dial.

Here is RoPE itself. A vector is split into 2D pairs; pair k spins at its own frequency θₖ = 10000^(−2k/d). Drag the position and every dial turns by position × θₖ — fast dials race, slow dials crawl. That bouquet of angles is the position code.

FIG.04 — d/2 ROTARY DIALS · each pair spins at its own rate
m = 0 m = 32
position m = 0
// rotate pair k of vector x at position m θₖ = 10000^(−2k/d) // the dial's speed angle = m · θₖ // turns with position [x'₀] [cos −sin] [x₀] [x'₁] = [sin cos] [x₁]

Watch the dials: the first pair (high frequency) sweeps all the way around as you move just a few steps — it resolves nearby order. The last pair barely twitches over the whole window — it carries long-range position.

Same geometric spread of frequencies as the sinusoids — but now applied as a rotation of Q and K instead of an addition to the embedding. Hit Walk the sequence to feel position scrolling through the dials.

05 / 06
Why it's clever · the relative-distance property

The score only knows how far apart.

This is the payoff. Rotate query q at position m and key k at position n, then dot them. The two rotations combine into one rotation by (m − n) — the absolute positions cancel. Drag both tokens and watch the score depend only on the gap between them.

FIG.05 — q(m)·k(n) DEPENDS ONLY ON m − n
query m 4
key n 10

ABSOLUTE POSITIONSm=4 · n=10
RELATIVE DISTANCE m − n−6
ATTENTION SCORE q·k
<R(m)q , R(n)k> = <q , R(n−m)k> // absolute m, n cancel → only the gap survives

Slide both sliders so the gap stays the same — say m=2,n=8 then m=10,n=16 — and the score barely moves: it's pinned to m − n = −6, not to where the pair sits in the sequence. The dot at the curve's marker is exactly that score.

This is relative position for free — no extra parameters, no bias table. And because a far-apart pair has spun the high-frequency dials many times, distant tokens naturally decorrelate: a soft, built-in distance decay.

06 / 06
The family · and the open problems

The family, and why context length is hard.

RoPE is one of several ways to teach a transformer about order. They trade off extrapolation, cost, and simplicity.

SchemeHow position entersAbs / RelExtrapolates?Used by
Sinusoidalfixed waves added to embeddingabsoluteweaklyorig. Transformer (2017)
Learned absolutea trainable vector per positionabsoluteno — hard cap at trained lengthBERT, GPT-2
ALiBilinear distance penalty on scoresrelativeyes, gracefullyBLOOM, MPT
RoPErotate Q and K by m·θrelative (via rotation)yes — with scaling tricksLLaMA, Mistral, Qwen, DeepSeek, Gemma
WHY CAN'T IT JUST READ A LONGER BOOK?

A model trained to position 2048 has never seen the angles for position 8000. The fast dials are at wildly unfamiliar rotations, so attention degrades. RoPE extrapolates better than learned tables — but not for free.

The fixes are clever angle hacks: NTK / linear scaling and YaRN stretch or re-base the frequencies so a model trained short can be served long. This is the engine behind almost every "128k context" headline you've read.

THE STICKY TRUTHS

Position isn't a layer you can swap at will — the base frequency 10000 is baked in at pre-training. RoPE touches only Q and K, never V, so values stay pure content. And it adds essentially zero parameters: it's geometry, not weights.

Position is still unsolved-feeling: sinusoids → learned → relative → ALiBi → RoPE → YaRN, and length-extrapolation remains finicky. The field hasn't found the last word — it's found the current best one.

01 · 04 — you made it

You can read
a position.

Attention is order-blind. Sinusoids stamped position by addition; RoPE stamps it by rotation — preserving meaning, encoding relative distance for free, and extrapolating further. You now know exactly why "dog bites man" can mean something to a transformer at all. You hold the model's sense of where.

00 Neural network from scratch · a neuron, a layer, a loss, backprop ✓ done
01·02 Tokenization · text → integers · BPE, byte-level, the split ✓ done
01·04 Positional encoding (RoPE) · teaching attention where each token sits ✓ complete
01·05 Attention & the Transformer · the engine that ate machine learning next
Next · 01 · 05

Attention & the Transformer →

Now that every token knows where it sits, watch them all look at each other at once. Q, K, V, softmax — the whole engine.

openalicelabs