Attention re-reads the entire past for every new token — flawless memory, but a bill that grows quadratically. A state-space model keeps one running summary instead and updates it token by token: constant memory, linear time. Mamba made that running summary selective — able to choose what to remember and what to forget — and suddenly recurrence was competitive again.
loading…
Tokens arrive one at a time. Each one nudges a single hidden state — a fixed-size vector — then is thrown away. The state alone carries everything the model remembers of the past. No growing cache, no looking back.
To generate token L, attention compares it against all L−1 tokens before it — and it stores every one in a KV-cache that grows without bound. That's O(L²) compute and O(L) memory. Drag the context length and watch the two curves diverge.
At a few thousand tokens the gap is harmless. At 128k–1M tokens the KV-cache swamps GPU memory and the quadratic term dominates latency — this is exactly where long-context serving gets expensive.
The whole pitch of state-space models is to replace that L² with a plain L, and replace the growing cache with one fixed-size state. The question is what you give up — we'll come to that.
An SSM keeps a single hidden state h. For each new token it does exactly two things: fade the old state a little, add in the new token, then read an output. That is the whole recurrence — three small matrices, repeated.
Multiplies the old state before the new token lands. Close to 1 keeps long memory; close to 0 forgets fast. This is the decay.
Projects the incoming token xₜ into the state. hₜ = Ā·hₜ₋₁ + B̄·xₜ — fade, then add. One step, every token.
Reads the current state into the layer's output: yₜ = C·hₜ. The state is private; C decides what's exposed.
The same operation has two equivalent shapes. As a recurrence (left) it runs one step at a time — perfect for generation, constant memory. As a convolution (right) the whole sequence is processed in parallel — perfect for training on a GPU.
Classical SSMs (S4) exploited this duality, but their matrices were fixed — the same fade and write for every token, regardless of content. That made them fast but a bit dumb: they couldn't choose to pay attention to one token over another. Mamba fixes precisely that.
Here is a real 8-dimensional SSM state. Press Step → to feed the next token: the bars are the hidden state, fading by Ā and gaining B̄·xₜ. Flip the selective gate on to give Mamba its superpower — content-dependent forgetting.
With the gate off, every token fades the state by the same amount — a plain S4-style SSM. With it on, an [important] token writes hard and barely decays, while filler tokens fade fast. That input-dependent choice is selection — the heart of Mamba.
Classical SSMs use the same Ā, B̄, C for every token. Mamba makes them functions of the current token — so the model can decide, per token, how much to write and how long to keep it. Drag the "importance" of a token and watch its trace in the state.
A fixed SSM treats "the" and a person's name identically — it has no way to filter. Selection lets the gate Ā snap toward 1 (hold this) or toward 0 (drop this) based on content, which is what makes Mamba good at language. The cost: the convolution view is lost, so Mamba uses a hardware-aware parallel scan to stay fast on a GPU.
The decay Ā is derived from an input-dependent step size Δ: a big Δ means "focus here, overwrite the state"; a tiny Δ means "ignore, let the past coast through." Selection is just Δ, B, C becoming data-dependent.
A fixed-size state is the whole advantage — and the whole problem. You cannot losslessly compress an unbounded past into bounded memory. Ask an SSM "what was the exact phone number 70 tokens ago?" and it may have already overwritten it. Push the sequence longer and watch recall fall off the card.
The ■ needle was written early. As newer tokens stream in they overwrite old slots — past capacity, the needle is lost.
Every token is kept verbatim, so the needle is always retrievable — at the cost of a cache that grows forever.
This is a hard limit, not a training artifact. No amount of data fixes the physics: bounded state means lossy memory. Pure attention pays in compute to keep perfect recall; pure SSM saves the compute but blurs the far past. The next section is how 2026 production models get both.
The 2024–26 answer is neither/both: build a stack of mostly cheap Mamba layers and sprinkle in a handful of full-attention layers (~7–10%) as occasional perfect-recall checkpoints. Drag the attention budget and watch the schedule — and the tradeoff — change.
| Model · year | Cheap mixer | Attention | Attn fraction | Context | Speedup claim |
|---|---|---|---|---|---|
| Jamba · 2024 | Mamba-1 | 1 per 8 layers | ~12.5% | 256k | ~3× vs Mixtral 8×7B (long ctx) |
| Mamba-2-Hybrid · 2024 | Mamba-2 | 4 of 28 mixers | ~8% | 32k | up to ~8× decode |
| Nemotron Nano 2 · 2025 | Mamba-2 | 6 of 62 layers | ~9.7% | 128k | ~6.3× vs Qwen3-8B |
| Qwen3-Next · 2026 | Gated DeltaNet | 1 per 4 blocks | ~25% | 256k+ | large long-ctx cache speedup |
| Nemotron 3 Super · 2026 | Mamba-2 + MoE | dispersed | small | 1M | 2.2–7.5× vs GPT-OSS-120B |
Evenly spacing the few attention layers beats stacking them. Nemotron Nano 2 explicitly disperses its 6 attention layers across 62.
Early and late attention layers contribute less than middle ones — keep a recall checkpoint reachable from any query position.
Hybrid sequence-mixing + sparse mixture-of-experts FFNs compose: linear-ish sequence cost and sublinear parameter cost at once.
The "cheap, constant-state" slot has competing fillers. They differ in how the running state is updated — and the choice changes the hybrid's character.
| Mixer | Lineage | State update | Strength | Used by |
|---|---|---|---|---|
| Mamba-2 | selective SSM | scalar gated decay + write | simple, tensor-core friendly | Nemotron-H / Nano 2 |
| S4 | structured SSM | fixed Ā, B̄, C (no selection) | fast convolution, long-range | early SSMs, audio |
| Gated DeltaNet | linear attention | gated decay + delta-rule rewrite | surgical memory updates | Qwen3-Next / 3.5 |
Mamba-2's gate can only fade memory uniformly. The delta rule adds a second move: surgically overwrite one key's value. The gate decides how much of the old state to forget; the delta term decides what specifically to rewrite — which is exactly what associative recall needs.
The optimal attention ratio and placement are unsolved — "~7–8%, dispersed" is a robust heuristic, not a derived law (Qwen3-Next sits near 25%). The recall ceiling is reduced, not removed. And serving is harder: two cache regimes (a growing KV-cache and a recurrent state) complicate batching, quantization, and speculative decoding.
Vendor numbers, read with care: the 3–8× speedups come from the labs shipping the models, on settings they chose. They're corroborated across NVIDIA, AI21, and Alibaba — so directionally trustworthy — but treat exact multipliers as optimistic.
Attention's quadratic bill. The recurrence — fade, write, read. The selective gate that lets the input steer memory. The fixed-state recall limit you can't train away. And the hybrid answer that ships: a few attention checkpoints over a sea of cheap Mamba. You now hold the model's other engine.
The O(L²) machine Mamba is built to replace. Revisit the KV-cache and softmax attention now that you know what the cheap alternative gives up.