A transformer never reads text. It reads a sequence of integers, each one indexing a learned vector. Tokenization is the lossy, non-neural function that turns a string into those integers — and a surprising share of "the model is dumb" moments trace straight back to it.
loading…
A string flows in. It is chopped into subword pieces — common words stay whole, rare words split into reusable parts — and each piece becomes a fixed integer ID. That integer is all the network ever sees.
The whole contract fits on one line each. Type below and watch any string become tokens and integer IDs — then watch the IDs reconstruct the string, exactly.
After the model emits token IDs, the same tokenizer maps them back to text. The only interesting question in this whole field is: what should the tokens be?
Notice the asymmetry: decode is a trivial lookup-and-concatenate. encode is an iterative greedy reduction — the part with all the cleverness.
There are three obvious ways to chop text — all bad. Subword tokenization is the pragmatic middle that wins every time. Drag the granularity and watch the cost flip.
Millions of words across languages, and any unseen word (a typo, a new name) becomes a single <UNK> hole the model can't represent.
Common words stay whole; rare words split into reusable pieces. Small fixed vocab, zero out-of-vocabulary failures, short-ish sequences.
Vocab is tiny and nothing is unknown — but sequences get enormous. Attention is quadratic in length, so you burn context budget and compute.
Slide to the left and the vocabulary shrinks toward ~100 symbols but the sequence balloons. Slide right and sequences shorten but the vocabulary explodes and unknown-word holes appear.
The middle keeps a small, fixed vocabulary and short sequences and zero unknowns. That is the entire reason subword tokenization exists.
BPE training is one loop: count the most frequent adjacent pair, mint a new token for it, replace every occurrence. Repeat. Press Merge → and watch the top pair fuse, live, on a tiny corpus.
The list of merges, in the order learned, is the model — a deterministic priority list, not a search. IDs 0–255 are raw bytes; learned merges start at 256.
What's your base vocabulary when the corpus is all of the internet — every emoji, every script? GPT-2's answer: the 256 byte values. Universal, finite, nothing is ever unknown.
Base vocabulary is exactly 256. Any string in any language is guaranteed representable — worst case, byte-by-byte. IDs 0–255 are raw bytes; 256+ are learned merges; special tokens sit above the vocab.
This is why a single emoji or a rare CJK character can cost several tokens: it's 3–4 UTF-8 bytes, and if those byte-pairs weren't frequent enough to earn a merge, they stay split — drawn here as dashed-ink byte chips.
Before BPE runs, a regex chops text into chunks — and merges may never cross a chunk boundary. GPT-4's pattern fixed several GPT-2 warts. Type and compare the two splits side by side.
GPT-2 only matched lowercase 's 't 're, so HOW'S split badly. GPT-4 handles uppercase and Unicode apostrophes.
No bespoke token for 2017 or 8675309. A deliberate band-aid for the arithmetic problem — long numbers chunk consistently.
Runs of spaces and newlines split more cleanly, so indented source code tokenizes sanely instead of entangling with the next word.
This is a genuine byte-level BPE tokenizer running in your browser — no fakes. It trains its own merges on a corpus, then encodes whatever you type. Train it, then play.
Train on the text you're tokenizing and watch the token count drop as merges fuse repeated pieces. Crank the vocab dial up and tokeniz + ation can become a single token. That is BPE earning its keep.
BPE has two famous cousins. They differ only in how they pick what to merge or keep.
| Algorithm | Direction | Merge / keep criterion | Encode-time | Used by |
|---|---|---|---|---|
| BPE | bottom-up merge | most frequent adjacent pair | replay merges in learned order | GPT-2/3/4, RoBERTa, tiktoken |
| WordPiece | bottom-up merge | max freq(AB)/(freq(A)·freq(B)) | greedy longest-match | BERT, DistilBERT |
| Unigram | top-down prune | min loss-increase if removed (probabilistic) | Viterbi best (or sampled) path | T5, LLaMA, ALBERT |
The model sees straw·berry as 2 opaque tokens — not a sequence of letters. The "r"s are fused inside the token's embedding, not laid out to be counted. Counting them requires un-fusing what the tokenizer fused.
Same root cause: flaky arithmetic (digits chunk inconsistently), the non-English token tax (5–10× more tokens for some scripts), and SolidGoldMagikarp glitch tokens — vocab entries that were never trained, so their embeddings stay near random init.
The tokenizer is frozen at pre-training time — baked into the embedding matrix and output head. You can't swap it without retraining. The big levers are vocab size, what corpus you trained merges on, and the pre-tokenization regex.
Whitespace is load-bearing: GPT-style tokenizers attach the leading space — " dog" is one token, distinct from "dog". A stray trailing space in your prompt can knock generation off-distribution.
Encode and decode. The three bad extremes. BPE training, byte-level, the regex split, the whole family. The integers your tokenizer just produced are exactly what the next stage turns into vectors of meaning. You now hold the model's front door.
Your token IDs are just indices. Now turn each one into a learned vector of meaning the network can actually reason over.