OpenAlice Academy — 01 · 06 / State-Space Models (Mamba)

01 / 08

Why bother · attention's quadratic bill

Attention re-reads everything, every step.

To generate token L, attention compares it against all L−1 tokens before it — and it stores every one in a KV-cache that grows without bound. That's O(L²) compute and O(L) memory. Drag the context length and watch the two curves diverge.

FIG.01 — COST vs SEQUENCE LENGTH · ATTENTION O(L²) vs SSM O(L)

SHORT 128k

CONTEXT LENGTH (L)—

ATTENTION WORK · ∝ L²—

SSM WORK · ∝ L—

ATTENTION IS SLOWER BY—

attention(L) : compute ∝ L² · KV-cache ∝ L ssm(L) : compute ∝ L · state size = const

At a few thousand tokens the gap is harmless. At 128k–1M tokens the KV-cache swamps GPU memory and the quadratic term dominates latency — this is exactly where long-context serving gets expensive.

The whole pitch of state-space models is to replace that L² with a plain L, and replace the growing cache with one fixed-size state. The question is what you give up — we'll come to that.

02 / 08

The core idea · a running summary

One index card instead of the whole pile.

An SSM keeps a single hidden state h. For each new token it does exactly two things: fade the old state a little, add in the new token, then read an output. That is the whole recurrence — three small matrices, repeated.

Ā · FORGET / CARRY

how much past survives

Multiplies the old state before the new token lands. Close to 1 keeps long memory; close to 0 forgets fast. This is the decay.

B̄ · WRITE

how the token enters

Projects the incoming token xₜ into the state. hₜ = Ā·hₜ₋₁ + B̄·xₜ — fade, then add. One step, every token.

C · READ

how the output comes out

Reads the current state into the layer's output: yₜ = C·hₜ. The state is private; C decides what's exposed.

FIG.02 — THE TWO FACES OF AN SSM · same math, two shapes

The same operation has two equivalent shapes. As a recurrence (left) it runs one step at a time — perfect for generation, constant memory. As a convolution (right) the whole sequence is processed in parallel — perfect for training on a GPU.

Classical SSMs (S4) exploited this duality, but their matrices were fixed — the same fade and write for every token, regardless of content. That made them fast but a bit dumb: they couldn't choose to pay attention to one token over another. Mamba fixes precisely that.

// the recurrence, in full hₜ = Ā·hₜ₋₁ + B̄·xₜ // fade old, add new yₜ = C·hₜ // read out

03 / 08

Watch it run · the selective scan

Walk a sequence through the state.

Here is a real 8-dimensional SSM state. Press Step → to feed the next token: the bars are the hidden state, fading by Ā and gaining B̄·xₜ. Flip the selective gate on to give Mamba its superpower — content-dependent forgetting.

FIG.03 — LIVE RECURRENT SCAN · 8-D HIDDEN STATE

HIDDEN STATE h · 8 channels · height = magnitude

SELECTIVE GATE

// one scan step, per token xₜ 1. fade h ← Ā ⊙ h // decay old memory 2. write h ← h + B̄·xₜ // add the new token 3. read yₜ = C·h // emit output selective: Ā, B̄ depend on xₜ

With the gate off, every token fades the state by the same amount — a plain S4-style SSM. With it on, an [important] token writes hard and barely decays, while filler tokens fade fast. That input-dependent choice is selection — the heart of Mamba.

04 / 08

Mamba's breakthrough · making the matrices look

Let the input steer the memory.

Classical SSMs use the same Ā, B̄, C for every token. Mamba makes them functions of the current token — so the model can decide, per token, how much to write and how long to keep it. Drag the "importance" of a token and watch its trace in the state.

WHY SELECTION MATTERS

A fixed SSM treats "the" and a person's name identically — it has no way to filter. Selection lets the gate Ā snap toward 1 (hold this) or toward 0 (drop this) based on content, which is what makes Mamba good at language. The cost: the convolution view is lost, so Mamba uses a hardware-aware parallel scan to stay fast on a GPU.

The decay Ā is derived from an input-dependent step size Δ: a big Δ means "focus here, overwrite the state"; a tiny Δ means "ignore, let the past coast through." Selection is just Δ, B, C becoming data-dependent.

FIG.04 — ONE TOKEN'S MEMORY TRACE · slide importance

FILLER KEY TOKEN

WRITE STRENGTH · Δ—

DECAY PER STEP · Ā—

HALF-LIFE (tokens)—

VERDICT—

05 / 08

The honest catch · a fixed state can't hold everything

The card is only so big.

A fixed-size state is the whole advantage — and the whole problem. You cannot losslessly compress an unbounded past into bounded memory. Ask an SSM "what was the exact phone number 70 tokens ago?" and it may have already overwritten it. Push the sequence longer and watch recall fall off the card.

FIG.05 — NEEDLE-IN-A-HAYSTACK · FIXED STATE vs ATTENTION

SSM · ONE FIXED STATE (slots = capacity)

The ■ needle was written early. As newer tokens stream in they overwrite old slots — past capacity, the needle is lost.

ATTENTION · KV-CACHE (every token kept)

Every token is kept verbatim, so the needle is always retrievable — at the cost of a cache that grows forever.

8 TOK 40 TOK

SEQUENCE LENGTH—

STATE CAPACITY (slots)8

SSM · NEEDLE RECALLED?—

ATTENTION · NEEDLE RECALLED?always ✓

This is a hard limit, not a training artifact. No amount of data fixes the physics: bounded state means lossy memory. Pure attention pays in compute to keep perfect recall; pure SSM saves the compute but blurs the far past. The next section is how 2026 production models get both.

06 / 08

What actually ships · interleave, don't choose

A few attention checkpoints in a sea of Mamba.

The 2024–26 answer is neither/both: build a stack of mostly cheap Mamba layers and sprinkle in a handful of full-attention layers (~7–10%) as occasional perfect-recall checkpoints. Drag the attention budget and watch the schedule — and the tradeoff — change.

FIG.06 — HYBRID LAYER SCHEDULE · A = attention · M = mamba

PURE MAMBA PURE ATTN

ATTENTION FRACTION—

KV-CACHE (relative to pure attn)—

EXACT RECALL—

ZONE—

Model · year	Cheap mixer	Attention	Attn fraction	Context	Speedup claim
Jamba · 2024	Mamba-1	1 per 8 layers	~12.5%	256k	~3× vs Mixtral 8×7B (long ctx)
Mamba-2-Hybrid · 2024	Mamba-2	4 of 28 mixers	~8%	32k	up to ~8× decode
Nemotron Nano 2 · 2025	Mamba-2	6 of 62 layers	~9.7%	128k	~6.3× vs Qwen3-8B
Qwen3-Next · 2026	Gated DeltaNet	1 per 4 blocks	~25%	256k+	large long-ctx cache speedup
Nemotron 3 Super · 2026	Mamba-2 + MoE	dispersed	small	1M	2.2–7.5× vs GPT-OSS-120B

DISPERSE, DON'T CLUSTER

spread the A layers

Evenly spacing the few attention layers beats stacking them. Nemotron Nano 2 explicitly disperses its 6 attention layers across 62.

NOT FIRST, NOT LAST

middle layers help most

Early and late attention layers contribute less than middle ones — keep a recall checkpoint reachable from any query position.

STACKS WITH MoE

cheap on two axes

Hybrid sequence-mixing + sparse mixture-of-experts FFNs compose: linear-ish sequence cost and sublinear parameter cost at once.

07 / 08

The cheap-mixer family · and the honest caveats

Mamba has cousins — and limits.

The "cheap, constant-state" slot has competing fillers. They differ in how the running state is updated — and the choice changes the hybrid's character.

Mixer	Lineage	State update	Strength	Used by
Mamba-2	selective SSM	scalar gated decay + write	simple, tensor-core friendly	Nemotron-H / Nano 2
S4	structured SSM	fixed Ā, B̄, C (no selection)	fast convolution, long-range	early SSMs, audio
Gated DeltaNet	linear attention	gated decay + delta-rule rewrite	surgical memory updates	Qwen3-Next / 3.5

GATED DELTANET · GATE + DELTA

Mamba-2's gate can only fade memory uniformly. The delta rule adds a second move: surgically overwrite one key's value. The gate decides how much of the old state to forget; the delta term decides what specifically to rewrite — which is exactly what associative recall needs.

// gated delta-rule update of state S Sₜ = αₜ·Sₜ₋₁ − βₜ·(Sₜ₋₁kₜ − vₜ)kₜᵀ └ gated decay ┘ └ delta correction ┘

THE HONEST CAVEATS

The optimal attention ratio and placement are unsolved — "~7–8%, dispersed" is a robust heuristic, not a derived law (Qwen3-Next sits near 25%). The recall ceiling is reduced, not removed. And serving is harder: two cache regimes (a growing KV-cache and a recurrent state) complicate batching, quantization, and speculative decoding.

Vendor numbers, read with care: the 3–8× speedups come from the labs shipping the models, on settings they chose. They're corroborated across NVIDIA, AI21, and Alibaba — so directionally trustworthy — but treat exact multipliers as optimistic.

01 · 06 — you made it

You ran
a state-space model.

Attention's quadratic bill. The recurrence — fade, write, read. The selective gate that lets the input steer memory. The fixed-state recall limit you can't train away. And the hybrid answer that ships: a few attention checkpoints over a sea of cheap Mamba. You now hold the model's other engine.

01·05 Attention & transformers · the O(L²) recall machine ✓ done

01·06 State-space models (Mamba) · linear-time recurrence · selection · hybrids ✓ complete

01·07 Mixture-of-experts · sublinear parameters · sparse FFNs next

01·08 Long-context serving · KV-caches, paging, the real bill locked

Back to the source · 01 · 05

Attention & Transformers →

The O(L²) machine Mamba is built to replace. Revisit the KV-cache and softmax attention now that you know what the cheap alternative gives up.

→

↑ Read it again Replay the scan

← The path

openalicelabs