openalicelabs / academy
COURSE ARCH-01 LESSON 01 · 02 TOPIC TOKENIZATION EST. READ ~11 MIN
OPENALICE LABORATORIES · EDUCATION PATH · ARCHITECTURE 01 · 02

How text
becomes
tokens.

A transformer never reads text. It reads a sequence of integers, each one indexing a learned vector. Tokenization is the lossy, non-neural function that turns a string into those integers — and a surprising share of "the model is dumb" moments trace straight back to it.

FIG.00 — STRING → IDS
loading…
FIG.0A — THE ONLY TWO FUNCTIONS · encode(str) → [int] · decode([int]) → str

A string flows in. It is chopped into subword pieces — common words stay whole, rare words split into reusable parts — and each piece becomes a fixed integer ID. That integer is all the network ever sees.

INPUTa Unicode string — any language, any script
OUTPUTa list of integer token IDs
VOCABULARY~50k – 200k fixed symbols
RULE OF THUMB1 token ≈ 4 chars ≈ 0.75 English words
TRAINED BYfrequency counting — NOT gradient descent
01 / 07
The interface · two functions

A tokenizer is just encode and decode.

The whole contract fits on one line each. Type below and watch any string become tokens and integer IDs — then watch the IDs reconstruct the string, exactly.

FIG.01 — LIVE ENCODE / DECODE ROUND-TRIP
encode(str) → token chips · hover a chip

encode(str) → [int] // text → IDs the model reads decode([int]) → str // IDs → text, exactly reversible

After the model emits token IDs, the same tokenizer maps them back to text. The only interesting question in this whole field is: what should the tokens be?

Notice the asymmetry: decode is a trivial lookup-and-concatenate. encode is an iterative greedy reduction — the part with all the cleverness.

02 / 07
Why subwords · the goldilocks zone

Per-word and per-char are both traps.

There are three obvious ways to chop text — all bad. Subword tokenization is the pragmatic middle that wins every time. Drag the granularity and watch the cost flip.

PER WORD

vocabulary explodes

Millions of words across languages, and any unseen word (a typo, a new name) becomes a single <UNK> hole the model can't represent.

SUBWORD ★

the goldilocks zone

Common words stay whole; rare words split into reusable pieces. Small fixed vocab, zero out-of-vocabulary failures, short-ish sequences.

PER CHAR / BYTE

sequences explode

Vocab is tiny and nothing is unknown — but sequences get enormous. Attention is quadratic in length, so you burn context budget and compute.

FIG.02 — SAME SENTENCE · THREE GRANULARITIES · COST FLIPS
SUBWORD

TOKENS IN SEQUENCE
VOCABULARY SIZE
UNKNOWN-WORD RISK
PER CHAR PER WORD

Slide to the left and the vocabulary shrinks toward ~100 symbols but the sequence balloons. Slide right and sequences shorten but the vocabulary explodes and unknown-word holes appear.

The middle keeps a small, fixed vocabulary and short sequences and zero unknowns. That is the entire reason subword tokenization exists.

03 / 07
The core algorithm · training the merges

Watch Byte-Pair Encoding learn.

BPE training is one loop: count the most frequent adjacent pair, mint a new token for it, replace every occurrence. Repeat. Press Merge → and watch the top pair fuse, live, on a tiny corpus.

FIG.03 — LIVE BPE TRAINING · TINY CORPUS
CORPUS (word · frequency) — current tokens highlighted
merge 0 / 6
// one BPE training step 1. count every adjacent (A,B) pair × word-freq 2. pick the most frequent pair 3. mint token AB, give it the next free ID 4. replace all A BAB repeat until vocab is full

The list of merges, in the order learned, is the model — a deterministic priority list, not a search. IDs 0–255 are raw bytes; learned merges start at 256.

04 / 07
The GPT-2 / GPT-4 / tiktoken trick

Run BPE over raw bytes, not characters.

What's your base vocabulary when the corpus is all of the internet — every emoji, every script? GPT-2's answer: the 256 byte values. Universal, finite, nothing is ever unknown.

WHY BYTES WIN

Base vocabulary is exactly 256. Any string in any language is guaranteed representable — worst case, byte-by-byte. IDs 0–255 are raw bytes; 256+ are learned merges; special tokens sit above the vocab.

This is why a single emoji or a rare CJK character can cost several tokens: it's 3–4 UTF-8 bytes, and if those byte-pairs weren't frequent enough to earn a merge, they stay split — drawn here as dashed-ink byte chips.

FIG.04 — UTF-8 BYTES → BPE MERGES · type non-ASCII
hello café 🍣 sushi 日本語 naïve
RAW UTF-8 BYTES (hex) · solid = merged · dashed = lone byte

CHARACTERS
UTF-8 BYTES
RESULTING TOKENS
05 / 07
Pre-tokenization · the regex split

The split GPT-4 fixed.

Before BPE runs, a regex chops text into chunks — and merges may never cross a chunk boundary. GPT-4's pattern fixed several GPT-2 warts. Type and compare the two splits side by side.

FIG.05 — PRE-TOKENIZATION SPLIT · GPT-2 vs GPT-4 (cl100k)
GPT-2 split— chunks
GPT-4 · cl100k split— chunks
HOW'S 2017 don't stop numbers whitespace
CONTRACTIONS

case-insensitive

GPT-2 only matched lowercase 's 't 're, so HOW'S split badly. GPT-4 handles uppercase and Unicode apostrophes.

NUMBERS

capped at 1–3 digits

No bespoke token for 2017 or 8675309. A deliberate band-aid for the arithmetic problem — long numbers chunk consistently.

WHITESPACE

better for code

Runs of spaces and newlines split more cleanly, so indented source code tokenizes sanely instead of entangling with the next word.

06 / 07
Build one yourself · live

A real BPE tokenizer. Right here.

This is a genuine byte-level BPE tokenizer running in your browser — no fakes. It trains its own merges on a corpus, then encodes whatever you type. Train it, then play.

FIG.06 — LIVE BYTE-LEVEL BPE · TRAIN → ENCODE
tokenize ×4 low/new/slow strawberry minbpe toy
tokens · hover any chip for its byte content + ID
TRAINING — vocab size dial
300
VOCAB SIZE256
LEARNED MERGES0
YOUR TEXT · BYTES
YOUR TEXT · TOKENS
COMPRESSION

Train on the text you're tokenizing and watch the token count drop as merges fuse repeated pieces. Crank the vocab dial up and tokeniz + ation can become a single token. That is BPE earning its keep.

07 / 07
BPE & its cousins · and the failure modes

The family, and why LLMs can't spell.

BPE has two famous cousins. They differ only in how they pick what to merge or keep.

AlgorithmDirectionMerge / keep criterionEncode-timeUsed by
BPEbottom-up mergemost frequent adjacent pairreplay merges in learned orderGPT-2/3/4, RoBERTa, tiktoken
WordPiecebottom-up mergemax freq(AB)/(freq(A)·freq(B))greedy longest-matchBERT, DistilBERT
Unigramtop-down prunemin loss-increase if removed (probabilistic)Viterbi best (or sampled) pathT5, LLaMA, ALBERT
WHY CAN'T IT SPELL "STRAWBERRY"?

The model sees straw·berry as 2 opaque tokens — not a sequence of letters. The "r"s are fused inside the token's embedding, not laid out to be counted. Counting them requires un-fusing what the tokenizer fused.

Same root cause: flaky arithmetic (digits chunk inconsistently), the non-English token tax (5–10× more tokens for some scripts), and SolidGoldMagikarp glitch tokens — vocab entries that were never trained, so their embeddings stay near random init.

THE STICKY DECISIONS

The tokenizer is frozen at pre-training time — baked into the embedding matrix and output head. You can't swap it without retraining. The big levers are vocab size, what corpus you trained merges on, and the pre-tokenization regex.

Whitespace is load-bearing: GPT-style tokenizers attach the leading space — " dog" is one token, distinct from "dog". A stray trailing space in your prompt can knock generation off-distribution.

01 · 02 — you made it

You built
a tokenizer.

Encode and decode. The three bad extremes. BPE training, byte-level, the regex split, the whole family. The integers your tokenizer just produced are exactly what the next stage turns into vectors of meaning. You now hold the model's front door.

00 Neural network from scratch · a neuron, a layer, a loss, backprop ✓ done
01·02 Tokenization · text → integers · BPE, byte-level, the split ✓ complete
01·03 Embeddings · turning a token ID into a vector of meaning next
01·04 Positional encoding (RoPE) · teaching attention where each token sits locked
Next · 01 · 03

Embeddings →

Your token IDs are just indices. Now turn each one into a learned vector of meaning the network can actually reason over.

openalicelabs