OpenAlice Academy — 01 · 02 / Tokenization

01 / 07

The interface · two functions

A tokenizer is just encode and decode.

The whole contract fits on one line each. Type below and watch any string become tokens and integer IDs — then watch the IDs reconstruct the string, exactly.

FIG.01 — LIVE ENCODE / DECODE ROUND-TRIP

encode(str) → token chips · hover a chip

encode(str) → [int] // text → IDs the model reads decode([int]) → str // IDs → text, exactly reversible

After the model emits token IDs, the same tokenizer maps them back to text. The only interesting question in this whole field is: what should the tokens be?

Notice the asymmetry: decode is a trivial lookup-and-concatenate. encode is an iterative greedy reduction — the part with all the cleverness.

02 / 07

Why subwords · the goldilocks zone

Per-word and per-char are both traps.

There are three obvious ways to chop text — all bad. Subword tokenization is the pragmatic middle that wins every time. Drag the granularity and watch the cost flip.

PER WORD

vocabulary explodes

Millions of words across languages, and any unseen word (a typo, a new name) becomes a single <UNK> hole the model can't represent.

SUBWORD ★

the goldilocks zone

Common words stay whole; rare words split into reusable pieces. Small fixed vocab, zero out-of-vocabulary failures, short-ish sequences.

PER CHAR / BYTE

sequences explode

Vocab is tiny and nothing is unknown — but sequences get enormous. Attention is quadratic in length, so you burn context budget and compute.

FIG.02 — SAME SENTENCE · THREE GRANULARITIES · COST FLIPS

SUBWORD

TOKENS IN SEQUENCE—

VOCABULARY SIZE—

UNKNOWN-WORD RISK—

PER CHAR PER WORD

Slide to the left and the vocabulary shrinks toward ~100 symbols but the sequence balloons. Slide right and sequences shorten but the vocabulary explodes and unknown-word holes appear.

The middle keeps a small, fixed vocabulary and short sequences and zero unknowns. That is the entire reason subword tokenization exists.

03 / 07

The core algorithm · training the merges

Watch Byte-Pair Encoding learn.

BPE training is one loop: count the most frequent adjacent pair, mint a new token for it, replace every occurrence. Repeat. Press Merge → and watch the top pair fuse, live, on a tiny corpus.

FIG.03 — LIVE BPE TRAINING · TINY CORPUS

CORPUS (word · frequency) — current tokens highlighted

merge 0 / 6

// one BPE training step 1. count every adjacent (A,B) pair × word-freq 2. pick the most frequent pair 3. mint token AB, give it the next free ID 4. replace all A B → AB repeat until vocab is full

The list of merges, in the order learned, is the model — a deterministic priority list, not a search. IDs 0–255 are raw bytes; learned merges start at 256.

04 / 07

The GPT-2 / GPT-4 / tiktoken trick

Run BPE over raw bytes, not characters.

What's your base vocabulary when the corpus is all of the internet — every emoji, every script? GPT-2's answer: the 256 byte values. Universal, finite, nothing is ever unknown.

WHY BYTES WIN

Base vocabulary is exactly 256. Any string in any language is guaranteed representable — worst case, byte-by-byte. IDs 0–255 are raw bytes; 256+ are learned merges; special tokens sit above the vocab.

This is why a single emoji or a rare CJK character can cost several tokens: it's 3–4 UTF-8 bytes, and if those byte-pairs weren't frequent enough to earn a merge, they stay split — drawn here as dashed-ink byte chips.

FIG.04 — UTF-8 BYTES → BPE MERGES · type non-ASCII

hello café 🍣 sushi 日本語 naïve

RAW UTF-8 BYTES (hex) · solid = merged · dashed = lone byte

CHARACTERS—

UTF-8 BYTES—

RESULTING TOKENS—

05 / 07

Pre-tokenization · the regex split

The split GPT-4 fixed.

Before BPE runs, a regex chops text into chunks — and merges may never cross a chunk boundary. GPT-4's pattern fixed several GPT-2 warts. Type and compare the two splits side by side.

FIG.05 — PRE-TOKENIZATION SPLIT · GPT-2 vs GPT-4 (cl100k)

GPT-2 split— chunks

GPT-4 · cl100k split— chunks

HOW'S 2017 don't stop numbers whitespace

CONTRACTIONS

case-insensitive

GPT-2 only matched lowercase 's 't 're, so HOW'S split badly. GPT-4 handles uppercase and Unicode apostrophes.

NUMBERS

capped at 1–3 digits

No bespoke token for 2017 or 8675309. A deliberate band-aid for the arithmetic problem — long numbers chunk consistently.

WHITESPACE

better for code

Runs of spaces and newlines split more cleanly, so indented source code tokenizes sanely instead of entangling with the next word.

06 / 07

Build one yourself · live

A real BPE tokenizer. Right here.

This is a genuine byte-level BPE tokenizer running in your browser — no fakes. It trains its own merges on a corpus, then encodes whatever you type. Train it, then play.

FIG.06 — LIVE BYTE-LEVEL BPE · TRAIN → ENCODE

tokenize ×4 low/new/slow strawberry minbpe toy

tokens · hover any chip for its byte content + ID

TRAINING — vocab size dial

300

VOCAB SIZE256

LEARNED MERGES0

YOUR TEXT · BYTES—

YOUR TEXT · TOKENS—

COMPRESSION—

Train on the text you're tokenizing and watch the token count drop as merges fuse repeated pieces. Crank the vocab dial up and tokeniz + ation can become a single token. That is BPE earning its keep.

07 / 07

BPE & its cousins · and the failure modes

The family, and why LLMs can't spell.

BPE has two famous cousins. They differ only in how they pick what to merge or keep.

Algorithm	Direction	Merge / keep criterion	Encode-time	Used by
BPE	bottom-up merge	most frequent adjacent pair	replay merges in learned order	GPT-2/3/4, RoBERTa, tiktoken
WordPiece	bottom-up merge	max freq(AB)/(freq(A)·freq(B))	greedy longest-match	BERT, DistilBERT
Unigram	top-down prune	min loss-increase if removed (probabilistic)	Viterbi best (or sampled) path	T5, LLaMA, ALBERT

WHY CAN'T IT SPELL "STRAWBERRY"?

The model sees straw·berry as 2 opaque tokens — not a sequence of letters. The "r"s are fused inside the token's embedding, not laid out to be counted. Counting them requires un-fusing what the tokenizer fused.

Same root cause: flaky arithmetic (digits chunk inconsistently), the non-English token tax (5–10× more tokens for some scripts), and SolidGoldMagikarp glitch tokens — vocab entries that were never trained, so their embeddings stay near random init.

THE STICKY DECISIONS

The tokenizer is frozen at pre-training time — baked into the embedding matrix and output head. You can't swap it without retraining. The big levers are vocab size, what corpus you trained merges on, and the pre-tokenization regex.

Whitespace is load-bearing: GPT-style tokenizers attach the leading space — " dog" is one token, distinct from "dog". A stray trailing space in your prompt can knock generation off-distribution.

01 · 02 — you made it

You built
a tokenizer.

Encode and decode. The three bad extremes. BPE training, byte-level, the regex split, the whole family. The integers your tokenizer just produced are exactly what the next stage turns into vectors of meaning. You now hold the model's front door.

00 Neural network from scratch · a neuron, a layer, a loss, backprop ✓ done

01·02 Tokenization · text → integers · BPE, byte-level, the split ✓ complete

01·03 Embeddings · turning a token ID into a vector of meaning next

01·04 Positional encoding (RoPE) · teaching attention where each token sits locked

Next · 01 · 03

Embeddings →

Your token IDs are just indices. Now turn each one into a learned vector of meaning the network can actually reason over.

→

↑ Read it again Replay the playground

← The path

openalicelabs