openalicelabs / academy
COURSE ARCH-01 LESSON 01 · 06 TOPIC STATE-SPACE MODELS EST. READ ~13 MIN
OPENALICE LABORATORIES · EDUCATION PATH · ARCHITECTURE 01 · 06

A model that
reads in
one pass.

Attention re-reads the entire past for every new token — flawless memory, but a bill that grows quadratically. A state-space model keeps one running summary instead and updates it token by token: constant memory, linear time. Mamba made that running summary selective — able to choose what to remember and what to forget — and suddenly recurrence was competitive again.

FIG.00 — ONE RUNNING STATE
loading…
FIG.0A — THE RECURRENCE · hₜ = Ā·hₜ₋₁ + B̄·xₜ · yₜ = C·hₜ · one fixed-size state carries the whole past

Tokens arrive one at a time. Each one nudges a single hidden state — a fixed-size vector — then is thrown away. The state alone carries everything the model remembers of the past. No growing cache, no looking back.

INPUTa sequence of token vectors x₁ … x_L
CARRIES THE PAST INone fixed-size hidden state h
COMPUTEO(L) — linear in sequence length
INFERENCE MEMORYconstant — no KV-cache that grows
MAMBA'S TRICKmake Ā, B̄, C depend on the input (selection)
01 / 08
Why bother · attention's quadratic bill

Attention re-reads everything, every step.

To generate token L, attention compares it against all L−1 tokens before it — and it stores every one in a KV-cache that grows without bound. That's O(L²) compute and O(L) memory. Drag the context length and watch the two curves diverge.

FIG.01 — COST vs SEQUENCE LENGTH · ATTENTION O(L²) vs SSM O(L)
SHORT 128k

CONTEXT LENGTH (L)
ATTENTION WORK · ∝ L²
SSM WORK · ∝ L
ATTENTION IS SLOWER BY
attention(L) : compute ∝ · KV-cache ∝ L ssm(L) : compute ∝ L · state size = const

At a few thousand tokens the gap is harmless. At 128k–1M tokens the KV-cache swamps GPU memory and the quadratic term dominates latency — this is exactly where long-context serving gets expensive.

The whole pitch of state-space models is to replace that with a plain L, and replace the growing cache with one fixed-size state. The question is what you give up — we'll come to that.

02 / 08
The core idea · a running summary

One index card instead of the whole pile.

An SSM keeps a single hidden state h. For each new token it does exactly two things: fade the old state a little, add in the new token, then read an output. That is the whole recurrence — three small matrices, repeated.

Ā · FORGET / CARRY

how much past survives

Multiplies the old state before the new token lands. Close to 1 keeps long memory; close to 0 forgets fast. This is the decay.

B̄ · WRITE

how the token enters

Projects the incoming token xₜ into the state. hₜ = Ā·hₜ₋₁ + B̄·xₜ — fade, then add. One step, every token.

C · READ

how the output comes out

Reads the current state into the layer's output: yₜ = C·hₜ. The state is private; C decides what's exposed.

FIG.02 — THE TWO FACES OF AN SSM · same math, two shapes

The same operation has two equivalent shapes. As a recurrence (left) it runs one step at a time — perfect for generation, constant memory. As a convolution (right) the whole sequence is processed in parallel — perfect for training on a GPU.

Classical SSMs (S4) exploited this duality, but their matrices were fixed — the same fade and write for every token, regardless of content. That made them fast but a bit dumb: they couldn't choose to pay attention to one token over another. Mamba fixes precisely that.

// the recurrence, in full hₜ = Ā·hₜ₋₁ + ·xₜ // fade old, add new yₜ = C·hₜ // read out
03 / 08
Watch it run · the selective scan

Walk a sequence through the state.

Here is a real 8-dimensional SSM state. Press Step → to feed the next token: the bars are the hidden state, fading by Ā and gaining B̄·xₜ. Flip the selective gate on to give Mamba its superpower — content-dependent forgetting.

FIG.03 — LIVE RECURRENT SCAN · 8-D HIDDEN STATE
HIDDEN STATE h · 8 channels · height = magnitude
// one scan step, per token xₜ 1. fade h ← Ā ⊙ h // decay old memory 2. write h ← h + B̄·xₜ // add the new token 3. read yₜ = C·h // emit output selective: Ā, B̄ depend on xₜ

With the gate off, every token fades the state by the same amount — a plain S4-style SSM. With it on, an [important] token writes hard and barely decays, while filler tokens fade fast. That input-dependent choice is selection — the heart of Mamba.

04 / 08
Mamba's breakthrough · making the matrices look

Let the input steer the memory.

Classical SSMs use the same Ā, B̄, C for every token. Mamba makes them functions of the current token — so the model can decide, per token, how much to write and how long to keep it. Drag the "importance" of a token and watch its trace in the state.

WHY SELECTION MATTERS

A fixed SSM treats "the" and a person's name identically — it has no way to filter. Selection lets the gate Ā snap toward 1 (hold this) or toward 0 (drop this) based on content, which is what makes Mamba good at language. The cost: the convolution view is lost, so Mamba uses a hardware-aware parallel scan to stay fast on a GPU.

The decay Ā is derived from an input-dependent step size Δ: a big Δ means "focus here, overwrite the state"; a tiny Δ means "ignore, let the past coast through." Selection is just Δ, B, C becoming data-dependent.

FIG.04 — ONE TOKEN'S MEMORY TRACE · slide importance
FILLER KEY TOKEN

WRITE STRENGTH · Δ
DECAY PER STEP · Ā
HALF-LIFE (tokens)
VERDICT
05 / 08
The honest catch · a fixed state can't hold everything

The card is only so big.

A fixed-size state is the whole advantage — and the whole problem. You cannot losslessly compress an unbounded past into bounded memory. Ask an SSM "what was the exact phone number 70 tokens ago?" and it may have already overwritten it. Push the sequence longer and watch recall fall off the card.

FIG.05 — NEEDLE-IN-A-HAYSTACK · FIXED STATE vs ATTENTION
SSM · ONE FIXED STATE (slots = capacity)

The needle was written early. As newer tokens stream in they overwrite old slots — past capacity, the needle is lost.

ATTENTION · KV-CACHE (every token kept)

Every token is kept verbatim, so the needle is always retrievable — at the cost of a cache that grows forever.

8 TOK 40 TOK

SEQUENCE LENGTH
STATE CAPACITY (slots)8
SSM · NEEDLE RECALLED?
ATTENTION · NEEDLE RECALLED?always ✓

This is a hard limit, not a training artifact. No amount of data fixes the physics: bounded state means lossy memory. Pure attention pays in compute to keep perfect recall; pure SSM saves the compute but blurs the far past. The next section is how 2026 production models get both.

06 / 08
What actually ships · interleave, don't choose

A few attention checkpoints in a sea of Mamba.

The 2024–26 answer is neither/both: build a stack of mostly cheap Mamba layers and sprinkle in a handful of full-attention layers (~7–10%) as occasional perfect-recall checkpoints. Drag the attention budget and watch the schedule — and the tradeoff — change.

FIG.06 — HYBRID LAYER SCHEDULE · A = attention · M = mamba
PURE MAMBA PURE ATTN

ATTENTION FRACTION
KV-CACHE (relative to pure attn)
EXACT RECALL
ZONE
Model · yearCheap mixerAttentionAttn fractionContextSpeedup claim
Jamba · 2024Mamba-11 per 8 layers~12.5%256k~3× vs Mixtral 8×7B (long ctx)
Mamba-2-Hybrid · 2024Mamba-24 of 28 mixers~8%32kup to ~8× decode
Nemotron Nano 2 · 2025Mamba-26 of 62 layers~9.7%128k~6.3× vs Qwen3-8B
Qwen3-Next · 2026Gated DeltaNet1 per 4 blocks~25%256k+large long-ctx cache speedup
Nemotron 3 Super · 2026Mamba-2 + MoEdispersedsmall1M2.2–7.5× vs GPT-OSS-120B
DISPERSE, DON'T CLUSTER

spread the A layers

Evenly spacing the few attention layers beats stacking them. Nemotron Nano 2 explicitly disperses its 6 attention layers across 62.

NOT FIRST, NOT LAST

middle layers help most

Early and late attention layers contribute less than middle ones — keep a recall checkpoint reachable from any query position.

STACKS WITH MoE

cheap on two axes

Hybrid sequence-mixing + sparse mixture-of-experts FFNs compose: linear-ish sequence cost and sublinear parameter cost at once.

07 / 08
The cheap-mixer family · and the honest caveats

Mamba has cousins — and limits.

The "cheap, constant-state" slot has competing fillers. They differ in how the running state is updated — and the choice changes the hybrid's character.

MixerLineageState updateStrengthUsed by
Mamba-2selective SSMscalar gated decay + writesimple, tensor-core friendlyNemotron-H / Nano 2
S4structured SSMfixed Ā, B̄, C (no selection)fast convolution, long-rangeearly SSMs, audio
Gated DeltaNetlinear attentiongated decay + delta-rule rewritesurgical memory updatesQwen3-Next / 3.5
GATED DELTANET · GATE + DELTA

Mamba-2's gate can only fade memory uniformly. The delta rule adds a second move: surgically overwrite one key's value. The gate decides how much of the old state to forget; the delta term decides what specifically to rewrite — which is exactly what associative recall needs.

// gated delta-rule update of state S Sₜ = αₜ·Sₜ₋₁βₜ·(Sₜ₋₁kₜ − vₜ)kₜᵀ └ gated decay ┘ └ delta correction ┘
THE HONEST CAVEATS

The optimal attention ratio and placement are unsolved — "~7–8%, dispersed" is a robust heuristic, not a derived law (Qwen3-Next sits near 25%). The recall ceiling is reduced, not removed. And serving is harder: two cache regimes (a growing KV-cache and a recurrent state) complicate batching, quantization, and speculative decoding.

Vendor numbers, read with care: the 3–8× speedups come from the labs shipping the models, on settings they chose. They're corroborated across NVIDIA, AI21, and Alibaba — so directionally trustworthy — but treat exact multipliers as optimistic.

01 · 06 — you made it

You ran
a state-space model.

Attention's quadratic bill. The recurrence — fade, write, read. The selective gate that lets the input steer memory. The fixed-state recall limit you can't train away. And the hybrid answer that ships: a few attention checkpoints over a sea of cheap Mamba. You now hold the model's other engine.

01·05 Attention & transformers · the O(L²) recall machine ✓ done
01·06 State-space models (Mamba) · linear-time recurrence · selection · hybrids ✓ complete
01·07 Mixture-of-experts · sublinear parameters · sparse FFNs next
01·08 Long-context serving · KV-caches, paging, the real bill locked
Back to the source · 01 · 05

Attention & Transformers →

The O(L²) machine Mamba is built to replace. Revisit the KV-cache and softmax attention now that you know what the cheap alternative gives up.

openalicelabs