You don't write the rules. You show it examples — pairs of (input, correct answer) — and it tunes its own internal numbers until it's right. That tuning is learning. Finish this page and you will understand backpropagation.
loading…
A tree that branches into smaller copies of itself. Backpropagation has the same shape — the chain rule applied recursively, each gradient a scaled copy of the one after it, flowing back from the root to every leaf.
A neuron does exactly three things to its inputs. Drag the weights below and watch the output move in real time — this is the whole forward mechanic.
Weights rotate the boundary. The bias shifts it. Without a bias, the dividing line is pinned through the origin (0,0) — feed in x = 0 and "no matter how you multiply zeros, they stay zero." Add a bias and the line floats free, exactly like y = m·x + b needs both a slope and an intercept.
One neuron draws one line. Stack them into layers and the boundary can bend. The hidden layer is the entire reason deep networks exist.
Each layer takes the numbers from the layer before, and every neuron computes its own weighted sum. Written as a matrix, the whole layer collapses into one multiply plus one add — which is why GPUs matter: a matrix multiply is the one thing hardware does blisteringly fast.
Feed numbers in on the left, apply the neuron rule layer by layer, read the answer off the right. Press Step → and watch the signal propagate.
To improve, the network needs to measure its error as a single number. Loss = 0 is perfect; bigger loss = more wrong. The whole goal of training is to make it small.
The forward pass gives a guess. Maybe right, maybe wildly off.
Subtract the correct answer. Square it so over- and under-shoot both count.
Add across all outputs. Big mistakes hurt more than small ones.
Gradient descent needs the slope ∂C/∂w for every weight. Backprop gets all of them in one backward pass — using the chain rule.
If A affects B and B affects C, then A's effect on C is (A→B) × (B→C). Multiply the local slopes along the path.
Start at the output where you know the error. Walk backward. At each step multiply by the local slope, handing each neuron its share of the blame. Forward computes the answer; backward computes who's responsible, and by how much.
The gradient is negative, so the step raises w
from 3 toward 5 — the value that makes p exactly right.
That is a neural network learning, in full, on one weight.
Gradients use +=, never =. If a value feeds two places, its gradient is the sum of both paths. That's the multivariate chain rule — forget it and your network silently learns the wrong thing.
Forward → measure → backward → nudge. Over and over. That's the whole algorithm.
This is a genuine 2→8→1 network running in your browser — no fakes. Click the board to place dots (left-click = blue, right-click = red), then press Train and watch the decision boundary form as the loss falls.
Try an XOR pattern (dots in opposite corners share a colour). Watch the boundary bend — that's the hidden layer earning its keep. A single neuron could only draw one straight line.
A neuron. A layer. A loss. Backprop. The same loop that taught one weight to climb 3 → 5 is — at enormous scale — what lets an LLM predict the next word. You now hold the universal primitive.