OpenAlice Academy — 01 · 04 / Positional Encoding (RoPE)

01 / 06

The problem · permutation equivariance

Attention sees a bag, not a line.

Self-attention is a weighted sum over all tokens. A sum doesn't care about order — shuffle the inputs and you just shuffle the outputs, the values are identical. Press shuffle and watch the attention output for the word "it" stay byte-for-byte the same no matter where the words sit.

FIG.01 — SHUFFLE THE SENTENCE · ATTENTION OUTPUT UNCHANGED

SEQUENCE (no positions) — the query token is highlighted

ATTENTION OUTPUT FOR "it" · ‖z‖—

VECTOR z (rounded)—

CHANGED BY SHUFFLE?—

Attention(Q,K,V) = softmax( Q·Kᵀ / √d ) · V // a weighted SUM over every token // permute the tokens P … Attn(PX) = P · Attn(X) // same set, just reordered

The dot products, the softmax weights, the value blend — none of them reference a token's index. So "dog bites man" and "man bites dog" produce the identical set of token representations. For a language model, that's a catastrophe.

We have to inject order from the outside. The only question left is how — and that question is this entire lesson.

02 / 06

The 2017 answer · fixed sinusoids

The original fix: a fingerprint of waves.

"Attention Is All You Need" gave every position its own pattern by stacking sine and cosine waves at many frequencies — fast waves for the low dimensions, slow ones for the high. Each position gets a unique, smoothly-varying fingerprint. Drag the position and read one off.

FIG.02 — THE POSITIONAL-ENCODING MATRIX · pos × dimension

pos 0 pos 63

SAMPLED ROW — PE(pos = 6) across dimensions

PE(pos, 2i) = sin( pos / 10000^(2i/d) ) PE(pos, 2i+1) = cos( pos / 10000^(2i/d) )

The leftmost dimensions oscillate fast — they flip between adjacent positions, encoding fine local order. The rightmost dimensions barely move across the whole sequence — they encode coarse, long-range position. Together: a binary-clock-like code where every position is distinct.

It's fixed, not learned — no parameters — and the wavelengths form a geometric progression from 2π up to about 10000·2π. This vector is then simply added to the token's embedding before the first attention layer.

03 / 06

The shift in thinking · absolute → relative

Stop adding. Start rotating.

Adding a position vector has a wart: it mixes "what a token means" with "where it is" inside the same numbers, and it encodes absolute position — yet language mostly cares about relative distance ("the adjective two words back"). RoPE's insight: don't add anything. Rotate the query and key by an angle proportional to position.

ADDITIVE (2017)

position bolted on

A fixed vector is added to the embedding. Simple, but absolute, and it entangles content with position. Struggles to extrapolate past the trained length.

ROTARY · RoPE ★

position in the angle

The vector's length (its meaning) is untouched; only its angle changes with position. Applied to Q and K, so it shows up only inside attention.

THE PAYOFF

relative for free

Rotate the query at m and the key at n, take their dot product — the angles subtract, leaving a function of m − n only.

FIG.03 — A ROTATION PRESERVES LENGTH · only the direction turns

A 2D rotation by angle mθ is the matrix [cos −sin; sin cos]. It never scales — the norm is identical before and after. So RoPE cannot distort how strongly two tokens match on content; it only adds a clean, position-dependent phase.

That length-preservation is exactly why you can apply it to Q and K without retraining the rest of the network into chaos. Meaning in, meaning out — plus a where.

04 / 06

The core mechanism · live rotary dials

Spin the RoPE dial.

Here is RoPE itself. A vector is split into 2D pairs; pair k spins at its own frequency θₖ = 10000^(−2k/d). Drag the position and every dial turns by position × θₖ — fast dials race, slow dials crawl. That bouquet of angles is the position code.

FIG.04 — d/2 ROTARY DIALS · each pair spins at its own rate

m = 0 m = 32

position m = 0

// rotate pair k of vector x at position m θₖ = 10000^(−2k/d) // the dial's speed angle = m · θₖ // turns with position [x'₀] [cos −sin] [x₀] [x'₁] = [sin cos] [x₁]

Watch the dials: the first pair (high frequency) sweeps all the way around as you move just a few steps — it resolves nearby order. The last pair barely twitches over the whole window — it carries long-range position.

Same geometric spread of frequencies as the sinusoids — but now applied as a rotation of Q and K instead of an addition to the embedding. Hit Walk the sequence to feel position scrolling through the dials.

05 / 06

Why it's clever · the relative-distance property

The score only knows how far apart.

This is the payoff. Rotate query q at position m and key k at position n, then dot them. The two rotations combine into one rotation by (m − n) — the absolute positions cancel. Drag both tokens and watch the score depend only on the gap between them.

FIG.05 — q(m)·k(n) DEPENDS ONLY ON m − n

query m 4

key n 10

ABSOLUTE POSITIONSm=4 · n=10

RELATIVE DISTANCE m − n−6

ATTENTION SCORE q·k—

<R(m)q , R(n)k> = <q , R(n−m)k> // absolute m, n cancel → only the gap survives

Slide both sliders so the gap stays the same — say m=2,n=8 then m=10,n=16 — and the score barely moves: it's pinned to m − n = −6, not to where the pair sits in the sequence. The dot at the curve's marker is exactly that score.

This is relative position for free — no extra parameters, no bias table. And because a far-apart pair has spun the high-frequency dials many times, distant tokens naturally decorrelate: a soft, built-in distance decay.

06 / 06

The family · and the open problems

The family, and why context length is hard.

RoPE is one of several ways to teach a transformer about order. They trade off extrapolation, cost, and simplicity.

Scheme	How position enters	Abs / Rel	Extrapolates?	Used by
Sinusoidal	fixed waves added to embedding	absolute	weakly	orig. Transformer (2017)
Learned absolute	a trainable vector per position	absolute	no — hard cap at trained length	BERT, GPT-2
ALiBi	linear distance penalty on scores	relative	yes, gracefully	BLOOM, MPT
RoPE	rotate Q and K by m·θ	relative (via rotation)	yes — with scaling tricks	LLaMA, Mistral, Qwen, DeepSeek, Gemma

WHY CAN'T IT JUST READ A LONGER BOOK?

A model trained to position 2048 has never seen the angles for position 8000. The fast dials are at wildly unfamiliar rotations, so attention degrades. RoPE extrapolates better than learned tables — but not for free.

The fixes are clever angle hacks: NTK / linear scaling and YaRN stretch or re-base the frequencies so a model trained short can be served long. This is the engine behind almost every "128k context" headline you've read.

THE STICKY TRUTHS

Position isn't a layer you can swap at will — the base frequency 10000 is baked in at pre-training. RoPE touches only Q and K, never V, so values stay pure content. And it adds essentially zero parameters: it's geometry, not weights.

Position is still unsolved-feeling: sinusoids → learned → relative → ALiBi → RoPE → YaRN, and length-extrapolation remains finicky. The field hasn't found the last word — it's found the current best one.

01 · 04 — you made it

You can read
a position.

Attention is order-blind. Sinusoids stamped position by addition; RoPE stamps it by rotation — preserving meaning, encoding relative distance for free, and extrapolating further. You now know exactly why "dog bites man" can mean something to a transformer at all. You hold the model's sense of where.

00 Neural network from scratch · a neuron, a layer, a loss, backprop ✓ done

01·02 Tokenization · text → integers · BPE, byte-level, the split ✓ done

01·04 Positional encoding (RoPE) · teaching attention where each token sits ✓ complete

01·05 Attention & the Transformer · the engine that ate machine learning next

Next · 01 · 05