OpenAlice Academy — 01 / Neural Network from Scratch

01 / 08

The atom · the perceptron

One neuron: multiply, add, bend.

A neuron does exactly three things to its inputs. Drag the weights below and watch the output move in real time — this is the whole forward mechanic.

FIG.02 — LIVE SINGLE NEURON

output = σ( w₁·x₁ + w₂·x₂ + b )

x₁ 0.80

w₁ 1.00

x₂ 0.60

w₂ -2.00

bias b 0.50

σ(z) = 1 / (1 + e⁻ᶻ) 0.00

Weights rotate the boundary. The bias shifts it. Without a bias, the dividing line is pinned through the origin (0,0) — feed in x = 0 and "no matter how you multiply zeros, they stay zero." Add a bias and the line floats free, exactly like y = m·x + b needs both a slope and an intercept.

02 / 08

Stacking neurons

A layer is just many neurons at once.

One neuron draws one line. Stack them into layers and the boundary can bend. The hidden layer is the entire reason deep networks exist.

Each layer takes the numbers from the layer before, and every neuron computes its own weighted sum. Written as a matrix, the whole layer collapses into one multiply plus one add — which is why GPUs matter: a matrix multiply is the one thing hardware does blisteringly fast.

z = W·x + b
a = σ(z) ← element-wise

INPUT · 2 HIDDEN · 5 OUTPUT · 1

FIG.03 — LAYERED TOPOLOGY · EDGE THICKNESS = |WEIGHT|

04 / 08

What "wrong" means as a number

The loss: one number for how wrong we are.

To improve, the network needs to measure its error as a single number. Loss = 0 is perfect; bigger loss = more wrong. The whole goal of training is to make it small.

PREDICT

ŷ = network(x)

The forward pass gives a guess. Maybe right, maybe wildly off.

COMPARE

error = ŷ − y

Subtract the correct answer. Square it so over- and under-shoot both count.

SUM

C = ½ Σ (ŷ − y)²

Add across all outputs. Big mistakes hurt more than small ones.

FIG.05 — TRAINING CURVES · ERROR ↓ (DASHED INK) · ACCURACY ↑ (SOLID ACCENT)

05 / 08

The heart · the engine

Backpropagation: blame, flowing backward.

Gradient descent needs the slope ∂C/∂w for every weight. Backprop gets all of them in one backward pass — using the chain rule.

THE CHAIN RULE, IN ONE SENTENCE

If A affects B and B affects C, then A's effect on C is (A→B) × (B→C). Multiply the local slopes along the path.

Start at the output where you know the error. Walk backward. At each step multiply by the local slope, handing each neuron its share of the blame. Forward computes the answer; backward computes who's responsible, and by how much.

stage 0 / 4

FIG.06 — A WORKED EXAMPLE · ONE WEIGHT

// x=2, w=3, target y=10, loss = e² p = w·x = 3·2 = 6 prediction e = p − y = 6−10 = −4 error L = e² = 16 loss // backward — multiply local slopes ∂L/∂e = 2·e = −8 ∂e/∂p = 1 ∂p/∂w = x = 2 ∂L/∂w = −8·1·2 = −16 // step downhill (η = 0.01) w ← w − η·(∂L/∂w) w ← 3 − 0.01·(−16) = 3.16 → … → 5

The gradient is negative, so the step raises w from 3 toward 5 — the value that makes p exactly right.
That is a neural network learning, in full, on one weight.

MICROGRAD · THE LOCAL DERIVATIVES MADE LITERAL

# × : route each input's gradient, scaled by the OTHER self.grad += other.data * out.grad other.grad += self.data * out.grad # + : copy the incoming gradient straight through self.grad += out.grad other.grad += out.grad

THE #1 BACKPROP BUG

Gradients use +=, never =. If a value feeds two places, its gradient is the sum of both paths. That's the multivariate chain rule — forget it and your network silently learns the wrong thing.

06 / 08

Putting it together

The entire training loop.

Forward → measure → backward → nudge. Over and over. That's the whole algorithm.

FIG.07 — THE LOOP · ONE PASS OVER THE DATA = ONE "EPOCH"

initialize all weights & biases to small random numbers repeat many times (each pass = an "epoch"): for each mini-batch of examples: ŷ = network.forward(inputs) §03 loss = cost(ŷ, correct_answer) §04 grad = backprop(loss) §05 — ∂loss/∂each param for each param p: p = p − η · grad[p] step downhill log → loss should trend DOWN, accuracy UP

07 / 08

Build one yourself · live

Train a real classifier. Right here.

This is a genuine 2→8→1 network running in your browser — no fakes. Click the board to place dots (left-click = blue, right-click = red), then press Train and watch the decision boundary form as the loss falls.

FIG.08 — LIVE DOT-CLASSIFIER · 2→8→1 · SIGMOID · SGD epoch 0

EPOCH0

LOSS—

ACCURACY—

DOTS0

LEARNING RATE η0.50

η 0.50

class A class B boundary

Try an XOR pattern (dots in opposite corners share a colour). Watch the boundary bend — that's the hidden layer earning its keep. A single neuron could only draw one straight line.

08 / 08 — you made it

You just built
a brain.

A neuron. A layer. A loss. Backprop. The same loop that taught one weight to climb 3 → 5 is — at enormous scale — what lets an LLM predict the next word. You now hold the universal primitive.

01 Neural network from scratch · a neuron, a layer, a loss, backprop ✓ complete

02 microGPT · a whole LLM in ~200 lines · same autograd, + attention next

03 LLM from scratch · a 10M-param GPT trained on a laptop locked

04 Transformers & attention · the architecture that made it all work locked

↑ Read it again Replay the live demo

openalicelabs

A network
is a function
that learns.

One neuron: multiply, add, bend.

A layer is just many neurons at once.

The forward pass, step by step.

The loss: one number for how wrong we are.

ŷ = network(x)

error = ŷ − y

C = ½ Σ (ŷ − y)²

Backpropagation: blame, flowing backward.

The entire training loop.

Train a real classifier. Right here.

You just built
a brain.

One neuron: multiply, add, bend.

A layer is just many neurons at once.

The forward pass, step by step.

The loss: one number for how wrong we are.

ŷ = network(x)

error = ŷ − y

C = ½ Σ (ŷ − y)²

Backpropagation: blame, flowing backward.

The entire training loop.

Train a real classifier. Right here.

You just builta brain.

You just built
a brain.