OpenAlice Academy — 02 / Math for ML

01 / 06

Intuition first

Three branches, three questions.

Each branch of math answers one question a neural network constantly asks. That's the whole map. Everything after this is the rigorous version of these three sentences.

LINEAR ALGEBRA

"Transform many numbers at once?"

A vector is a list of numbers — a point or arrow in space. A matrix transforms vectors. The most-run operation on Earth right now is y = W·x.

CALCULUS

"Nudge a knob — better or worse?"

The derivative is slope = sensitivity. The gradient is every derivative at once: a giant arrow pointing toward more error. So we step the opposite way. That's training.

PROBABILITY

"How confident should I be?"

Raw scores ("logits") aren't probabilities. Softmax squashes them into a distribution that sums to 1. Cross-entropy then measures how surprised the model was by the truth.

IN AN LLM

A transformer layer is these three, wired: a matmul (linear algebra) → a softmax over attention scores (probability) → another matmul → a nonlinearity. Training it is one long application of the chain rule (calculus). Remove any branch and there is no LLM.

02 / 06

The atom of everything

The dot product measures alignment.

Two vectors, same length. Multiply matching entries, add them up — one scalar. Big positive = same direction. Zero = perpendicular. Negative = opposing. Drag the two arrows below and watch it move.

FIG.02 — DRAG EITHER ARROWHEAD

a · b = a₁b₁ + a₂b₂ = ‖a‖‖b‖·cos θ

vector a( 1.00, 0.30 )

vector b( 0.20, 1.00 )

‖a‖ · ‖b‖— · —

angle θ—

a · b—

cos similarity—

vector a vector b aligned

IN AN LLM

Attention is literally a pile of dot products: query · key asks "how aligned is what I'm looking for with what this token offers?" Semantic search (Atlas's vector index) ranks by cosine similarity — a dot product of length-normalized vectors. The alignment intuition you just dragged is why vector search works.

03 / 06

Dot products in bulk

A matmul is a table of dot products.

Each output entry is one dot product — row i of A with column j of B. A matrix multiply computes a whole table of them at once. Hover a cell to see which row meets which column.

The inner dimensions must match — A is m×n, B is n×p, out is m×p. "Shapes must line up" is the #1 beginner error. GPUs exist primarily to do this one operation blisteringly fast.

Cᵢⱼ = Σₖ Aᵢₖ·Bₖⱼ
// the workhorse layer:
y = W·x + b

A · 2×3 B · 3×2 C · 2×2

FIG.03 — MATMUL · HOVER A C-CELL

IN AN LLM

Every projection — the Q, K, V projections, the feed-forward blocks, the output head over the whole vocabulary — is a matmul. A transformer does billions per token. "Think in matrices, not loops" is the single biggest speed lever linear algebra buys you (10–100× on a GPU).

04 / 06

Slope = sensitivity

The derivative: nudge x, how much moves?

A derivative answers one question: "if I push x by a tiny ε, how much does f change?" Drag the point on the curve — the tangent line is the slope, and the slope is the gradient at that spot.

FIG.04 — TANGENT = LOCAL SLOPE · DRAG THE POINT

f(x+ε) ≈ f(x) + ε·(df/dx)
// the few you must know cold
d/dx(xⁿ) = n·x^(n−1)
d/dx(eˣ) = eˣ
d/dx σ(x) = σ(x)·(1−σ(x))

point x—

f(x) = ¼x²—

slope df/dx = ½x—

THE CHAIN RULE, IN ONE SENTENCE

If A affects B and B affects C, then A's effect on C is (A→B) × (B→C). Multiply the local slopes along the path. A network is a deep composition, so its derivative is a long product of local slopes — that product, computed efficiently, is backprop.

05 / 06

Using the gradient to learn

Gradient descent: roll downhill.

The gradient points uphill (toward more loss). So step the opposite way. Click anywhere in the bowl to drop a ball and watch it descend. Then push the learning rate too high and watch it overshoot — exactly how real training diverges to NaN.

FIG.05 — LOSS BOWL · CLICK TO DROP A BALL

θₜ₊₁ = θₜ − η·∇L(θₜ)

STEP0

POSITION θ—

LOSS L(θ)—

GRADIENT ∇L—

STATUSidle

η rate 0.18

η is the master knob. Too small → painfully slow. Too large → it overshoots the minimum and the loss diverges. Drag η past ~0.9 and the ball climbs out of the bowl — that's a real network blowing up to NaN, drawn.

IN AN LLM

This loop — forward → loss → gradient → step downhill — is the entire training algorithm. Real models use SGD (estimate the gradient from a random mini-batch, far cheaper) and its descendant Adam (per-parameter adaptive steps + momentum), but the core is still θ ← θ − η·(scaled gradient).

06 / 06

Scores → probabilities

Softmax turns logits into confidence.

Raw scores ("logits") aren't probabilities — some are negative, they don't sum to 1. Softmax exponentiates and normalizes: every output lands in (0,1) and they sum to exactly 1. Drag the logit sliders and watch the distribution react.

softmax(z)ⱼ = e^(zⱼ) / Σₖ e^(zₖ)
// stability — subtract the max first
= e^(zⱼ−m) / Σ e^(zₖ−m), m = max z

logit z₁ 2.1

logit z₂ 0.5

logit z₃ -1.0

logit z₄ -0.5

temp τ 1.00

Temperature rescales logits (z/τ) before softmax — low τ sharpens (greedy), high τ flattens (random). It's the dial behind "creative vs. deterministic."

FIG.06 — P(next token) · Σ = 1.000

Σ probabilities1.000

argmax (sampled greedily)z₁

entropy (bits)—

IN AN LLM

The final layer applies softmax over the entire vocabulary to produce P(next token | context). Attention also uses softmax — over alignment scores — to weight how much each previous token contributes. Sampling a reply means drawing from this exact distribution; temperature is the slider you just dragged.

THE PUNCHLINE · WHY THIS PAIRING IS EVERYWHERE

It all collapses
to p − y.

Combine softmax and cross-entropy, differentiate the loss w.r.t. the logits, and the messy Jacobian and the log's derivative cancel almost everything — collapsing to predicted minus true. Over-predicted a class? Push its logit down by the over-prediction. That clean, cheap gradient is the reason softmax+cross-entropy is the default head on every classifier and every LLM.

// cross-entropy: −log of the prob you gave the truth Lᵢ = −log( p_yᵢ ) // differentiate w.r.t. the logits → everything cancels ∂L / ∂zⱼ = pⱼ − yⱼ // (predicted) − (one-hot truth)

Linear algebra moved the numbers. Calculus told us which way is downhill. Probability scored them. You now hold the literacy to read essentially every other article in the library — attention, embeddings, quantization, LoRA — they're all variations on these three.

01 Neural network from scratch · a neuron, a layer, a loss, backprop done

02 Math for ML · vectors, gradients, softmax — only the parts you use ✓ you are here

03 microGPT · a whole tiny LLM · same autograd, + attention next

04 LLM from scratch · a 10M-param GPT trained on a laptop locked

NEXT · RUNG 0300 · 03

microGPT — build an LLM from scratch

Same autograd you just learned, plus attention: query·key dot products → softmax → a weighted sum of values. Every piece on this page, now wired into a working transformer.

→

← The path

openalicelabs

Three branches, three questions.

"Transform many numbers at once?"

"Nudge a knob — better or worse?"

"How confident should I be?"

The dot product measures alignment.

A matmul is a table of dot products.

The derivative: nudge x, how much moves?

Gradient descent: roll downhill.

Softmax turns logits into confidence.

It all collapsesto p − y.

microGPT — build an LLM from scratch

It all collapses
to p − y.