What it is
Mamba is a sequence modeling architecture built on structured state spaces (SSMs) — a class of recurrent models that process tokens one at a time, maintaining a fixed-size hidden state rather than attending over the full history. The key innovation is selective state spaces: the model's gating is input-dependent, letting it decide what to remember and what to discard at each step. The result is linear-time, constant-memory inference regardless of sequence length — a structural contrast to the O(n²) cost of standard Transformer attention.
How it works
At each layer, Mamba replaces the attention sublayer with a discretized state-space recurrence. For a sequence of length T, the hidden state h is updated as:
`` h_t = A · h_{t-1} + B · x_t y_t = C · h_t ``
where A, B, and C are functions of the input (not fixed), giving the model its selectivity. During training the recurrence can be parallelized via a scan algorithm; at inference it runs as a true RNN — one step, constant memory. Mamba-2 refined the parameterization for better hardware utilization. Subsequent work (dynamic short convolutions, Gated DeltaNet-2) has shown that augmenting or replacing this core with additional gating mechanisms can push quality further.
Why it matters
The practical case for Mamba rests on three properties:
1. Inference efficiency at long context. Codestral Mamba (7.3B, Mistral AI, July 2024) demonstrated in-context retrieval tested up to 256k tokens with linear-time inference — a workload that would be prohibitively expensive for a standard Transformer. TTT-E2E, a 2026 meta-learning long-context method from Astera/NVIDIA/Stanford/Berkeley/UCSD, explicitly benchmarks itself against Mamba-2 inference speed and matches it, confirming Mamba-2 as the practical efficiency reference point for the field.
2. Viability at scale. Falcon Mamba (7B, Technology Innovation Institute, August 2024) was the first attention-free model at that parameter count to match or exceed transformer-based models on standard benchmarks — a milestone that moved SSMs from "interesting at small scale" to "competitive at deployment scale."
3. Composability. Mamba layers mix cleanly with Transformer blocks. NVIDIA's Nemotron 3 Nano 4B ships as a hybrid Mamba-Transformer for on-device inference, and dynamic short convolutions (a 2026 technique) improve Mamba-2 perplexity with a measured 1.33–1.60× compute advantage when applied to its linear RNN layers.
Variants and the competitive landscape
The subquadratic sequence modeling space has become crowded, with Mamba as the common baseline:
- Mamba-3 (1.5B, CMU + Together.AI, March 2026) improves accuracy over Mamba-2 while staying within the SSM family.
- Gated DeltaNet-2 (NVIDIA Labs, May 2026) decouples the erase and write operations in the delta-rule update into independent channel-wise gates, outperforming Mamba-2, Mamba-3, Gated DeltaNet, and Kimi Delta Attention on language modeling, commonsense reasoning, and long-context RULER needle-in-a-haystack retrieval at 1.3B parameters trained on 100B FineWeb-Edu tokens.
- xLSTM consistently outperforms Mamba-2 and Gated DeltaNet on code pre-training, LLM distillation, and time-series foundation model pre-training, with authors attributing the gap to more flexible memory correction via its gating scheme.
- Dynamic short convolutions (June 2026) improve Mamba-2 perplexity as a drop-in augmentation, with scaling-law fits showing a 1.33× compute advantage.
The pattern is clear: Mamba established the template and the benchmark, and the field is now iterating on its gating and memory dynamics.
Beyond language: edge and domain applications
MambaGaze (May 2026) applies bidirectional Mamba-2 to eye-gaze time-series for cognitive load assessment, achieving 76.8% and 73.1% accuracy on the CLARE and CL-Drive datasets — outperforming CNN, Transformer, ResNet, and VGG baselines by 4–12 percentage points — while running at 43–68 FPS under 7.5W on NVIDIA Jetson hardware. This illustrates Mamba's fit for edge and safety-critical deployments where both latency and power budgets are constrained.
Tradeoffs and when not to use it
Mamba's recurrent formulation is its strength and its limitation. The fixed-size hidden state means the model must compress all prior context into a bounded representation — it cannot, in principle, perform exact retrieval over arbitrary history the way attention can. TTT-E2E's failure on Needle-in-a-Haystack retrieval beyond 8,000 tokens (a method that shares the bounded-state philosophy) illustrates the class-level risk. For tasks requiring precise recall of specific tokens deep in a long context, hybrid architectures or attention-augmented designs may be preferable. For throughput-sensitive, long-sequence workloads where approximate recall suffices — streaming inference, on-device deployment, time-series — Mamba and its descendants remain a strong choice.
Where it's heading
Mamba is no longer the frontier architecture — Gated DeltaNet-2 and xLSTM have moved ahead on benchmarks — but it remains the reference point the field measures against, and the Mamba lineage (Mamba-3, hybrid Mamba-Transformer) continues to ship in production. The trajectory suggests the SSM design space will keep fragmenting into specialized variants optimized for particular hardware, context lengths, and task types, with Mamba's core ideas — selective gating, linear recurrence, hardware-aware parallelism — persisting as foundational primitives even as the specific architecture evolves.




