OpenAlice Academy — 01 · 03 / Embeddings

01 / 07

The core object · one lookup table

It's all just one matrix of floats.

The whole "semantic space" lives in a single big table — the embedding matrix E, shape [vocab × dim]. Row i is the vector for token i. Pick a token and watch its row light up — that's the entire lookup.

FIG.01 — LOOKUP IS A ONE-HOT × E = "GRAB ONE ROW"

embed(id) = E[id] // grab row `id` = onehot(id) · E // a matmul, secretly an index

A one-hot vector is all zeros except a single 1 at position id. Multiply it by E and you select exactly one row — which is why people say the embedding layer is a matmul that's secretly a lookup.

E starts as random numbers. Nothing about the initial values means anything — meaning is carved into the matrix by backpropagation, the same way every other weight is learned. By the end of training, similar tokens have drifted to similar rows.

02 / 07

Nearness = similarity · the cartographer

Similar things land near each other.

Picture a vast map where every concept is a pin. Real spaces have hundreds of dimensions, but the principle is the same as a paper map — distance carries information. Drag a word and watch its nearest neighbours update live by cosine similarity.

FIG.02 — A 2-D SLICE OF MEANING-SPACE · DRAG ANY PIN

real spaces are 768–4096-D — this is a flattened 2-D slice. drag pet toward finance and watch its neighbours flip.

NEAREST TO cat · by cosine

"cat" and "dog" land close; "cat" and "democracy" land far apart. The model was never told what a cat is — the geometry fell out of predicting which words appear near which.

This is the discrete → continuous jump: symbols have no inherent distance ("apple" isn't between cat and dog), but vectors do. A smooth, differentiable space is exactly what gradient learning needs.

03 / 07

Directions = relationships · the famous trick

You can do arithmetic on meaning.

In good embeddings the direction from "man" to "woman" is roughly the same as "king" to "queen." So king − man + woman ≈ queen. Pick an analogy and watch the vector subtraction-then-addition play out, then land on its nearest neighbour.

FIG.03 — ANALOGY ARITHMETIC · a − b + c → nearest neighbour

MODEL'S ANSWER · nearest to the result

Honest caveat: this is cherry-picked. It works on curated examples and usually has to exclude the input words from the answer set, or the nearest vector is just "king" again. Real, but fragile — don't oversell it as general reasoning.

04 / 07

word2vec · predict the neighbours

Meaning is learned by good company.

The distributional hypothesis: "a word is known by the company it keeps." word2vec slides a window over raw text; words sharing contexts get pushed toward similar vectors. Step the window and watch skip-gram push the real neighbours together and random words apart.

FIG.04 — SKIP-GRAM · CENTER → CONTEXT WINDOW

CORPUS — window highlighted · center vs context

// skip-gram objective maximize Σ log P(context | center) P(o|c) = softmax( u_o · v_c ) // dot product score // negative sampling (cheap): push real (center,context) dot up ▲ push k random negatives dot down ▼

A full softmax over a 50k vocabulary is expensive, so word2vec uses negative sampling: nudge the real pair's dot product up, and a handful of random words' dot products down. That contrastive shape — real pairs up, random pairs down — comes back at sentence scale later.

GloVe reaches the same geometry from the other side: factorize a global word–word co-occurrence matrix so that wᵢ · wⱼ ≈ log(count). Predictive and count-based converge.

05 / 07

The workhorse metric · angle, not distance

Similarity is just the cosine of an angle.

"How similar are these two vectors?" becomes "what's the cosine of the angle between them?" — one cheap operation. Drag the second vector and watch the cosine swing from +1 (same direction) through 0 (unrelated) to −1 (opposite).

FIG.05 — COSINE SIMILARITY · DRAG VECTOR b

cos(a, b) = (a · b) / (‖a‖ · ‖b‖)

+1.00

identical direction

ANGLE θ0°

DOT PRODUCT a·b—

‖a‖ · ‖b‖—

cosine ignores magnitude (often just frequency/length noise) and keeps only direction — which carries the meaning.

THE SHORTCUT

normalize → it's a dot

L2-normalize every vector to unit length and cos(a,b) = a·b exactly. Cosine, dot, and Euclidean nearest-neighbour then all rank results identically.

AT SCALE

ANN, not brute force

You don't compare against millions of vectors one by one — HNSW / IVF indexes trade a sliver of recall for orders-of-magnitude speed. pgvector does both.

WHY NOT EUCLIDEAN?

direction beats distance

Raw distance is swayed by vector length, which often just tracks token frequency. Cosine throws that away — which is why it's the safe default for semantic text.

06 / 07

BERT, Sentence-BERT, and the bi-encoder

One vector per word is not enough.

word2vec is static: "bank" gets one vector whether you mean a river or money. The big shift — a token's vector should depend on its sentence. Toggle the sense and watch "bank" move.

FIG.06 — STATIC vs CONTEXTUAL · "bank" disambiguates

"sat by the river bank" "deposit it at the bank"

BERT pre-trains by masking ~15% of tokens and predicting them from both-sides context (the MLM objective). To fill a blank you must build a context-aware representation of every token — so a token's "embedding" is no longer a fixed table row, it's the hidden state after attention has let every token mix with every other.

SENTENCE-BERT · THE BI-ENCODER

Comparing the most similar pair among 10,000 sentences with full BERT needs ~50M inferences (~65 hours). SBERT runs BERT once per sentence, mean-pools into one vector, and fine-tunes so cosine reflects meaning. The 65-hour search drops to ~5 seconds — the foundation of every vector DB and RAG pipeline.

Modern retrieval embedders are trained with a contrastive InfoNCE loss: pull query + relevant doc together, push many irrelevant docs away. Same shape as word2vec negative sampling, scaled to sentences.

07 / 07

The tradeoffs · and the honest caveats

The family, and where it bites.

Embedding flavours differ mostly in context, training signal, and granularity. Pick the simplest one that works.

Axis	Option A	Option B	The tradeoff
Context	Static — word2vec, GloVe	Contextual — BERT, LLMs	static = one vector/word, fast, no sense disambiguation. contextual = sense-aware, needs a forward pass.
Signal	Predictive — word2vec, MLM	Count-based — GloVe	converge to similar geometry; count-based uses global stats, predictive streams local windows.
Granularity	Token embeddings	Sentence / doc embeddings	tokens feed transformers; pooled sentence vectors feed search & RAG. different jobs.
Architecture	Bi-encoder (encode once)	Cross-encoder (encode the pair)	bi-encoder is fast + indexable; cross-encoder is accurate but O(N²). combo: retrieve, then re-rank.
Dimension d	Small (256)	Large (1536–4096)	bigger = more expressive but more storage/compute. Matryoshka embeddings let you truncate.

ANALOGIES ARE CHERRY-PICKED

king − man + woman ≈ queen is real but fragile — it works on curated cases and usually excludes the input words. It is not general reasoning.

Embeddings inherit and amplify bias — the same geometry that makes analogies work also encodes "doctor − man + woman" distortions. And "similar" is under-specified: similar how? topically, sentimentally, stylistically? A space bakes in one notion of similarity, set by its training data.

THE OPERATIONAL HAZARD

Re-embedding your corpus with a new model produces vectors in a different space. You cannot mix old and new vectors in one index — an embedder upgrade means a full re-index. People forget this and silently corrupt retrieval.

Anisotropy: raw BERT token vectors crowd into a narrow cone, so everything looks similar and cosine is poorly calibrated — exactly why SBERT-style fine-tuning exists. And quality is silent: a bad embedder still returns some nearest neighbour, just the wrong one. Always evaluate on your domain — leaderboard rank ≠ your-domain rank.

01 · 03 — you made it

You built
a meaning-space.

The embedding matrix. Nearness as similarity, directions as relationships, analogy arithmetic. How word2vec learns it, how cosine measures it, how BERT makes it contextual and SBERT makes it searchable. Every token ID is now a vector the network can reason over — and the next stage is what lets those vectors talk to each other.

01·01 Neural network from scratch · a neuron, a layer, a loss, backprop ✓ done

01·02 Tokenization · text → integers · BPE, byte-level, the split ✓ done

01·03 Embeddings · turning a token ID into a vector of meaning ✓ complete

01·04 Attention & Transformers · letting vectors mix to form context next

Next · 01 · 04