openalicelabs / academy
COURSE NN-101 LESSON 01 / 08 EST. READ ~9 MIN LIGHT · v0.1
OPENALICE LABORATORIES · EDUCATION PATH · RUNG 01

A network
is a function
that learns.

You don't write the rules. You show it examples — pairs of (input, correct answer) — and it tunes its own internal numbers until it's right. That tuning is learning. Finish this page and you will understand backpropagation.

FIG.00 — RECURSIVE MOTIF
loading…
FIG.0A — RECURSIVE BINARY TREE · SELF-SIMILAR GRADIENT FLOW · THE CHAIN RULE, DRAWN

A tree that branches into smaller copies of itself. Backpropagation has the same shape — the chain rule applied recursively, each gradient a scaled copy of the one after it, flowing back from the root to every leaf.

FIG.01 — A FEED-FORWARD NETWORK · SIGNAL → SOLID ACCENT · ERROR ← DASHED INK
INPUTx = (x₁, x₂) — a dot's coordinates
OUTPUTŷ ∈ (0,1) — red or blue?
LEARNABLEweights W · biases b
ENGINEbackpropagation + gradient descent
01 / 08
The atom · the perceptron

One neuron: multiply, add, bend.

A neuron does exactly three things to its inputs. Drag the weights below and watch the output move in real time — this is the whole forward mechanic.

FIG.02 — LIVE SINGLE NEURON
output = σ( w₁·x₁ + w₂·x₂ + b )
0.80
1.00
0.60
-2.00
0.50

σ(z) = 1 / (1 + e⁻ᶻ) 0.00

Weights rotate the boundary. The bias shifts it. Without a bias, the dividing line is pinned through the origin (0,0) — feed in x = 0 and "no matter how you multiply zeros, they stay zero." Add a bias and the line floats free, exactly like y = m·x + b needs both a slope and an intercept.

02 / 08
Stacking neurons

A layer is just many neurons at once.

One neuron draws one line. Stack them into layers and the boundary can bend. The hidden layer is the entire reason deep networks exist.

Each layer takes the numbers from the layer before, and every neuron computes its own weighted sum. Written as a matrix, the whole layer collapses into one multiply plus one add — which is why GPUs matter: a matrix multiply is the one thing hardware does blisteringly fast.

z = W·x + b
a = σ(z)  ← element-wise
INPUT · 2 HIDDEN · 5 OUTPUT · 1
FIG.03 — LAYERED TOPOLOGY · EDGE THICKNESS = |WEIGHT|
03 / 08
Compute the answer

The forward pass, step by step.

Feed numbers in on the left, apply the neuron rule layer by layer, read the answer off the right. Press Step → and watch the signal propagate.

FIG.04 — FORWARD PROPAGATION
for layer L:  aᴸ = σ( Wᴸ·aᴸ⁻¹ + bᴸ )
stage 0 / 4
04 / 08
What "wrong" means as a number

The loss: one number for how wrong we are.

To improve, the network needs to measure its error as a single number. Loss = 0 is perfect; bigger loss = more wrong. The whole goal of training is to make it small.

PREDICT

ŷ = network(x)

The forward pass gives a guess. Maybe right, maybe wildly off.

COMPARE

error = ŷ − y

Subtract the correct answer. Square it so over- and under-shoot both count.

SUM

C = ½ Σ (ŷ − y)²

Add across all outputs. Big mistakes hurt more than small ones.

FIG.05 — TRAINING CURVES · ERROR ↓ (DASHED INK) · ACCURACY ↑ (SOLID ACCENT)
05 / 08
The heart · the engine

Backpropagation: blame, flowing backward.

Gradient descent needs the slope ∂C/∂w for every weight. Backprop gets all of them in one backward pass — using the chain rule.

THE CHAIN RULE, IN ONE SENTENCE

If A affects B and B affects C, then A's effect on C is (A→B) × (B→C). Multiply the local slopes along the path.

Start at the output where you know the error. Walk backward. At each step multiply by the local slope, handing each neuron its share of the blame. Forward computes the answer; backward computes who's responsible, and by how much.

stage 0 / 4
FIG.06 — A WORKED EXAMPLE · ONE WEIGHT
// x=2, w=3, target y=10, loss = e² p = w·x = 3·2 = 6 prediction e = p − y = 6−10 = −4 error L = e² = 16 loss // backward — multiply local slopes ∂L/∂e = 2·e = −8 ∂e/∂p = 1 ∂p/∂w = x = 2 ∂L/∂w = −8·1·2 = −16 // step downhill (η = 0.01) w ← w − η·(∂L/∂w) w ← 3 − 0.01·(−16) = 3.16 → … → 5

The gradient is negative, so the step raises w from 3 toward 5 — the value that makes p exactly right.
That is a neural network learning, in full, on one weight.

MICROGRAD · THE LOCAL DERIVATIVES MADE LITERAL
# × : route each input's gradient, scaled by the OTHER self.grad += other.data * out.grad other.grad += self.data * out.grad # + : copy the incoming gradient straight through self.grad += out.grad other.grad += out.grad
THE #1 BACKPROP BUG

Gradients use +=, never =. If a value feeds two places, its gradient is the sum of both paths. That's the multivariate chain rule — forget it and your network silently learns the wrong thing.

06 / 08
Putting it together

The entire training loop.

Forward → measure → backward → nudge. Over and over. That's the whole algorithm.

FIG.07 — THE LOOP · ONE PASS OVER THE DATA = ONE "EPOCH"
initialize all weights & biases to small random numbers repeat many times (each pass = an "epoch"): for each mini-batch of examples: ŷ = network.forward(inputs) §03 loss = cost(ŷ, correct_answer) §04 grad = backprop(loss) §05 — ∂loss/∂each param for each param p: p = p − η · grad[p] step downhill log → loss should trend DOWN, accuracy UP
07 / 08
Build one yourself · live

Train a real classifier. Right here.

This is a genuine 2→8→1 network running in your browser — no fakes. Click the board to place dots (left-click = blue, right-click = red), then press Train and watch the decision boundary form as the loss falls.

FIG.08 — LIVE DOT-CLASSIFIER · 2→8→1 · SIGMOID · SGD epoch 0
EPOCH0
LOSS
ACCURACY
DOTS0
LEARNING RATE η0.50
0.50
class A class B boundary

Try an XOR pattern (dots in opposite corners share a colour). Watch the boundary bend — that's the hidden layer earning its keep. A single neuron could only draw one straight line.

08 / 08 — you made it

You just built
a brain.

A neuron. A layer. A loss. Backprop. The same loop that taught one weight to climb 3 → 5 is — at enormous scale — what lets an LLM predict the next word. You now hold the universal primitive.

01 Neural network from scratch · a neuron, a layer, a loss, backprop ✓ complete
02 microGPT · a whole LLM in ~200 lines · same autograd, + attention next
03 LLM from scratch · a 10M-param GPT trained on a laptop locked
04 Transformers & attention · the architecture that made it all work locked
openalicelabs