An LLM is three pieces of high-school-plus math in a trench coat: linear algebra to move numbers in bulk, calculus to find which way is "less wrong," and probability to turn scores into confidence. Understand y = Wx, the chain rule, and softmax — and you understand the load-bearing 80% of what happens inside GPT.
loading…
A tree that branches into smaller copies of itself. The chain rule has the same shape — a long product of local slopes, each gradient a scaled copy of the one after it, flowing back from the root to every leaf. All of training is this one idea, applied at scale.
Each branch of math answers one question a neural network constantly asks. That's the whole map. Everything after this is the rigorous version of these three sentences.
A vector is a list of numbers — a point or arrow in space. A matrix transforms vectors. The most-run operation on Earth right now is y = W·x.
The derivative is slope = sensitivity. The gradient is every derivative at once: a giant arrow pointing toward more error. So we step the opposite way. That's training.
Raw scores ("logits") aren't probabilities. Softmax squashes them into a distribution that sums to 1. Cross-entropy then measures how surprised the model was by the truth.
A transformer layer is these three, wired: a matmul (linear algebra) → a softmax over attention scores (probability) → another matmul → a nonlinearity. Training it is one long application of the chain rule (calculus). Remove any branch and there is no LLM.
Two vectors, same length. Multiply matching entries, add them up — one scalar. Big positive = same direction. Zero = perpendicular. Negative = opposing. Drag the two arrows below and watch it move.
Attention is literally a pile of dot products: query · key asks "how aligned is what I'm looking for with what this token offers?" Semantic search (Atlas's vector index) ranks by cosine similarity — a dot product of length-normalized vectors. The alignment intuition you just dragged is why vector search works.
Each output entry is one dot product — row i of A with column j of B. A matrix multiply computes a whole table of them at once. Hover a cell to see which row meets which column.
The inner dimensions must match — A is m×n, B is n×p, out is m×p. "Shapes must line up" is the #1 beginner error. GPUs exist primarily to do this one operation blisteringly fast.
Every projection — the Q, K, V projections, the feed-forward blocks, the output head over the whole vocabulary — is a matmul. A transformer does billions per token. "Think in matrices, not loops" is the single biggest speed lever linear algebra buys you (10–100× on a GPU).
A derivative answers one question: "if I push x by a tiny ε, how much does f change?" Drag the point on the curve — the tangent line is the slope, and the slope is the gradient at that spot.
If A affects B and B affects C, then A's effect on C is (A→B) × (B→C). Multiply the local slopes along the path. A network is a deep composition, so its derivative is a long product of local slopes — that product, computed efficiently, is backprop.
The gradient points uphill (toward more loss). So step the opposite way. Click anywhere in the bowl to drop a ball and watch it descend. Then push the learning rate too high and watch it overshoot — exactly how real training diverges to NaN.
η is the master knob. Too small → painfully slow. Too large → it overshoots the minimum and the loss diverges. Drag η past ~0.9 and the ball climbs out of the bowl — that's a real network blowing up to NaN, drawn.
This loop — forward → loss → gradient → step downhill — is the entire training algorithm. Real models use SGD (estimate the gradient from a random mini-batch, far cheaper) and its descendant Adam (per-parameter adaptive steps + momentum), but the core is still θ ← θ − η·(scaled gradient).
Raw scores ("logits") aren't probabilities — some are negative, they don't sum to 1. Softmax exponentiates and normalizes: every output lands in (0,1) and they sum to exactly 1. Drag the logit sliders and watch the distribution react.
Temperature rescales logits (z/τ) before softmax — low τ sharpens (greedy), high τ flattens (random). It's the dial behind "creative vs. deterministic."
The final layer applies softmax over the entire vocabulary to produce P(next token | context). Attention also uses softmax — over alignment scores — to weight how much each previous token contributes. Sampling a reply means drawing from this exact distribution; temperature is the slider you just dragged.
Combine softmax and cross-entropy, differentiate the loss w.r.t. the logits, and the messy Jacobian and the log's derivative cancel almost everything — collapsing to predicted minus true. Over-predicted a class? Push its logit down by the over-prediction. That clean, cheap gradient is the reason softmax+cross-entropy is the default head on every classifier and every LLM.
Linear algebra moved the numbers. Calculus told us which way is downhill. Probability scored them. You now hold the literacy to read essentially every other article in the library — attention, embeddings, quantization, LoRA — they're all variations on these three.
Same autograd you just learned, plus attention: query·key dot products → softmax → a weighted sum of values. Every piece on this page, now wired into a working transformer.
→