openalicelabs / academy
COURSE ML-FOUND LESSON 00 · 02 EST. READ ~11 MIN LIGHT · v0.1
OPENALICE LABORATORIES · EDUCATION PATH · RUNG 02

The math
you actually need.

An LLM is three pieces of high-school-plus math in a trench coat: linear algebra to move numbers in bulk, calculus to find which way is "less wrong," and probability to turn scores into confidence. Understand y = Wx, the chain rule, and softmax — and you understand the load-bearing 80% of what happens inside GPT.

FIG.00 — RECURSIVE MOTIF
loading…
FIG.0A — RECURSIVE BINARY TREE · SELF-SIMILAR GRADIENT FLOW · THE CHAIN RULE, DRAWN

A tree that branches into smaller copies of itself. The chain rule has the same shape — a long product of local slopes, each gradient a scaled copy of the one after it, flowing back from the root to every leaf. All of training is this one idea, applied at scale.

FIG.01 — ONE TRANSFORMER LAYER · MATMUL → SOFTMAX → MATMUL · THE THREE BRANCHES, WIRED
MOVE THE NUMBERSlinear algebra — y = W·x (a table of dot products)
FIX THE NUMBERScalculus — ∇L tells us which way is downhill
SCORE THE NUMBERSprobability — softmax → P(next token)
THE PUNCHLINE∂L/∂z = p − y — predicted minus true
01 / 06
Intuition first

Three branches, three questions.

Each branch of math answers one question a neural network constantly asks. That's the whole map. Everything after this is the rigorous version of these three sentences.

LINEAR ALGEBRA

"Transform many numbers at once?"

A vector is a list of numbers — a point or arrow in space. A matrix transforms vectors. The most-run operation on Earth right now is y = W·x.

CALCULUS

"Nudge a knob — better or worse?"

The derivative is slope = sensitivity. The gradient is every derivative at once: a giant arrow pointing toward more error. So we step the opposite way. That's training.

PROBABILITY

"How confident should I be?"

Raw scores ("logits") aren't probabilities. Softmax squashes them into a distribution that sums to 1. Cross-entropy then measures how surprised the model was by the truth.

IN AN LLM

A transformer layer is these three, wired: a matmul (linear algebra) → a softmax over attention scores (probability) → another matmul → a nonlinearity. Training it is one long application of the chain rule (calculus). Remove any branch and there is no LLM.

02 / 06
The atom of everything

The dot product measures alignment.

Two vectors, same length. Multiply matching entries, add them up — one scalar. Big positive = same direction. Zero = perpendicular. Negative = opposing. Drag the two arrows below and watch it move.

FIG.02 — DRAG EITHER ARROWHEAD
drag the • tips
a · b = a₁b₁ + a₂b₂ = ‖a‖‖b‖·cos θ
vector a( 1.00, 0.30 )
vector b( 0.20, 1.00 )
‖a‖ · ‖b‖— · —
angle θ
a · b
cos similarity
vector a vector b aligned
IN AN LLM

Attention is literally a pile of dot products: query · key asks "how aligned is what I'm looking for with what this token offers?" Semantic search (Atlas's vector index) ranks by cosine similarity — a dot product of length-normalized vectors. The alignment intuition you just dragged is why vector search works.

03 / 06
Dot products in bulk

A matmul is a table of dot products.

Each output entry is one dot product — row i of A with column j of B. A matrix multiply computes a whole table of them at once. Hover a cell to see which row meets which column.

The inner dimensions must match — A is m×n, B is n×p, out is m×p. "Shapes must line up" is the #1 beginner error. GPUs exist primarily to do this one operation blisteringly fast.

Cᵢⱼ = Σₖ Aᵢₖ·Bₖⱼ
// the workhorse layer:
y = W·x + b
A · 2×3 B · 3×2 C · 2×2
FIG.03 — MATMUL · HOVER A C-CELL
IN AN LLM

Every projection — the Q, K, V projections, the feed-forward blocks, the output head over the whole vocabulary — is a matmul. A transformer does billions per token. "Think in matrices, not loops" is the single biggest speed lever linear algebra buys you (10–100× on a GPU).

04 / 06
Slope = sensitivity

The derivative: nudge x, how much moves?

A derivative answers one question: "if I push x by a tiny ε, how much does f change?" Drag the point on the curve — the tangent line is the slope, and the slope is the gradient at that spot.

FIG.04 — TANGENT = LOCAL SLOPE · DRAG THE POINT
drag along the curve
f(x+ε) ≈ f(x) + ε·(df/dx)
// the few you must know cold
d/dx(xⁿ) = n·x^(n−1)
d/dx(eˣ) =
d/dx σ(x) = σ(x)·(1−σ(x))
point x
f(x) = ¼x²
slope df/dx = ½x
THE CHAIN RULE, IN ONE SENTENCE

If A affects B and B affects C, then A's effect on C is (A→B) × (B→C). Multiply the local slopes along the path. A network is a deep composition, so its derivative is a long product of local slopes — that product, computed efficiently, is backprop.

05 / 06
Using the gradient to learn

Gradient descent: roll downhill.

The gradient points uphill (toward more loss). So step the opposite way. Click anywhere in the bowl to drop a ball and watch it descend. Then push the learning rate too high and watch it overshoot — exactly how real training diverges to NaN.

FIG.05 — LOSS BOWL · CLICK TO DROP A BALL
θ ← θ − η·∇L click the bowl
θₜ₊₁ = θₜ − η·∇L(θₜ)
STEP0
POSITION θ
LOSS L(θ)
GRADIENT ∇L
STATUSidle
0.18

η is the master knob. Too small → painfully slow. Too large → it overshoots the minimum and the loss diverges. Drag η past ~0.9 and the ball climbs out of the bowl — that's a real network blowing up to NaN, drawn.

IN AN LLM

This loop — forward → loss → gradient → step downhill — is the entire training algorithm. Real models use SGD (estimate the gradient from a random mini-batch, far cheaper) and its descendant Adam (per-parameter adaptive steps + momentum), but the core is still θ ← θ − η·(scaled gradient).

06 / 06
Scores → probabilities

Softmax turns logits into confidence.

Raw scores ("logits") aren't probabilities — some are negative, they don't sum to 1. Softmax exponentiates and normalizes: every output lands in (0,1) and they sum to exactly 1. Drag the logit sliders and watch the distribution react.

softmax(z) = e^(zⱼ) / Σₖ e^(zₖ)
// stability — subtract the max first
= e^(zⱼ−m) / Σ e^(zₖ−m),  m = max z
2.1
0.5
-1.0
-0.5
1.00

Temperature rescales logits (z/τ) before softmax — low τ sharpens (greedy), high τ flattens (random). It's the dial behind "creative vs. deterministic."

FIG.06 — P(next token) · Σ = 1.000

Σ probabilities1.000
argmax (sampled greedily)z₁
entropy (bits)
IN AN LLM

The final layer applies softmax over the entire vocabulary to produce P(next token | context). Attention also uses softmax — over alignment scores — to weight how much each previous token contributes. Sampling a reply means drawing from this exact distribution; temperature is the slider you just dragged.

THE PUNCHLINE · WHY THIS PAIRING IS EVERYWHERE

It all collapses
to p − y.

Combine softmax and cross-entropy, differentiate the loss w.r.t. the logits, and the messy Jacobian and the log's derivative cancel almost everything — collapsing to predicted minus true. Over-predicted a class? Push its logit down by the over-prediction. That clean, cheap gradient is the reason softmax+cross-entropy is the default head on every classifier and every LLM.

// cross-entropy: −log of the prob you gave the truth L = −log( p_yᵢ ) // differentiate w.r.t. the logits → everything cancels ∂L / ∂z = py // (predicted) − (one-hot truth)

Linear algebra moved the numbers. Calculus told us which way is downhill. Probability scored them. You now hold the literacy to read essentially every other article in the library — attention, embeddings, quantization, LoRA — they're all variations on these three.

01 Neural network from scratch · a neuron, a layer, a loss, backprop done
02 Math for ML · vectors, gradients, softmax — only the parts you use ✓ you are here
03 microGPT · a whole tiny LLM · same autograd, + attention next
04 LLM from scratch · a 10M-param GPT trained on a laptop locked
NEXT · RUNG 0300 · 03

microGPT — build an LLM from scratch

Same autograd you just learned, plus attention: query·key dot products → softmax → a weighted sum of values. Every piece on this page, now wired into a working transformer.

openalicelabs