Concept guide · In-depth

Mamba: State Space Models as a Practical Alternative to Transformers

MambaIn-depthactive·v1 · live·generated 6d ago

Part of these paths

Long Context Evolution · Step 1 of 10

TL;DRMamba is a state space model (SSM) architecture that replaces the quadratic-cost attention mechanism of Transformers with a recurrent, linear-time computation — making inference faster and memory use constant regardless of sequence length. It has matured from a research curiosity into a deployed architecture, spawning production models and a competitive lineage of variants, while also becoming a benchmark target that newer linear-attention designs race to beat.

Key takeaways

Mamba processes sequences in linear time and constant memory, versus the O(n²) cost of standard attention — a structural advantage at long context.
Falcon Mamba (7B, TII, Aug 2024) was the first attention-free model at that scale to match transformer baselines; Codestral Mamba (7.3B, Mistral, Jul 2024) demonstrated linear-time code inference tested up to 256k tokens.
Mamba-3 (1.5B, CMU + Together.AI) improved accuracy over Mamba-2; Nemotron 3 Nano 4B (NVIDIA) ships a hybrid Mamba-Transformer for on-device use.
Gated DeltaNet-2 (NVIDIA Labs, May 2026) outperforms Mamba-2 and Mamba-3 on language modeling, commonsense reasoning, and long-context RULER retrieval at 1.3B parameters.
Dynamic short convolutions improve Mamba-2 perplexity with a measured compute advantage; xLSTM outperforms Mamba-2 on complex sequence tasks — the subquadratic space is actively contested.
Mamba-2 inference speed matches TTT-E2E (a meta-learning long-context method), confirming it as a practical efficiency baseline for long-context research.

What it is

Mamba is a sequence modeling architecture built on structured state spaces (SSMs) — a class of recurrent models that process tokens one at a time, maintaining a fixed-size hidden state rather than attending over the full history. The key innovation is selective state spaces: the model's gating is input-dependent, letting it decide what to remember and what to discard at each step. The result is linear-time, constant-memory inference regardless of sequence length — a structural contrast to the O(n²) cost of standard Transformer attention.

How it works

At each layer, Mamba replaces the attention sublayer with a discretized state-space recurrence. For a sequence of length T, the hidden state h is updated as:

`` h_t = A · h_{t-1} + B · x_t y_t = C · h_t ``

where A, B, and C are functions of the input (not fixed), giving the model its selectivity. During training the recurrence can be parallelized via a scan algorithm; at inference it runs as a true RNN — one step, constant memory. Mamba-2 refined the parameterization for better hardware utilization. Subsequent work (dynamic short convolutions, Gated DeltaNet-2) has shown that augmenting or replacing this core with additional gating mechanisms can push quality further.

Why it matters

The practical case for Mamba rests on three properties:

1. Inference efficiency at long context. Codestral Mamba (7.3B, Mistral AI, July 2024) demonstrated in-context retrieval tested up to 256k tokens with linear-time inference — a workload that would be prohibitively expensive for a standard Transformer. TTT-E2E, a 2026 meta-learning long-context method from Astera/NVIDIA/Stanford/Berkeley/UCSD, explicitly benchmarks itself against Mamba-2 inference speed and matches it, confirming Mamba-2 as the practical efficiency reference point for the field.

2. Viability at scale. Falcon Mamba (7B, Technology Innovation Institute, August 2024) was the first attention-free model at that parameter count to match or exceed transformer-based models on standard benchmarks — a milestone that moved SSMs from "interesting at small scale" to "competitive at deployment scale."

3. Composability. Mamba layers mix cleanly with Transformer blocks. NVIDIA's Nemotron 3 Nano 4B ships as a hybrid Mamba-Transformer for on-device inference, and dynamic short convolutions (a 2026 technique) improve Mamba-2 perplexity with a measured 1.33–1.60× compute advantage when applied to its linear RNN layers.

Variants and the competitive landscape

The subquadratic sequence modeling space has become crowded, with Mamba as the common baseline:

Mamba-3 (1.5B, CMU + Together.AI, March 2026) improves accuracy over Mamba-2 while staying within the SSM family.
Gated DeltaNet-2 (NVIDIA Labs, May 2026) decouples the erase and write operations in the delta-rule update into independent channel-wise gates, outperforming Mamba-2, Mamba-3, Gated DeltaNet, and Kimi Delta Attention on language modeling, commonsense reasoning, and long-context RULER needle-in-a-haystack retrieval at 1.3B parameters trained on 100B FineWeb-Edu tokens.
xLSTM consistently outperforms Mamba-2 and Gated DeltaNet on code pre-training, LLM distillation, and time-series foundation model pre-training, with authors attributing the gap to more flexible memory correction via its gating scheme.
Dynamic short convolutions (June 2026) improve Mamba-2 perplexity as a drop-in augmentation, with scaling-law fits showing a 1.33× compute advantage.

The pattern is clear: Mamba established the template and the benchmark, and the field is now iterating on its gating and memory dynamics.

Beyond language: edge and domain applications

MambaGaze (May 2026) applies bidirectional Mamba-2 to eye-gaze time-series for cognitive load assessment, achieving 76.8% and 73.1% accuracy on the CLARE and CL-Drive datasets — outperforming CNN, Transformer, ResNet, and VGG baselines by 4–12 percentage points — while running at 43–68 FPS under 7.5W on NVIDIA Jetson hardware. This illustrates Mamba's fit for edge and safety-critical deployments where both latency and power budgets are constrained.

Tradeoffs and when not to use it

Mamba's recurrent formulation is its strength and its limitation. The fixed-size hidden state means the model must compress all prior context into a bounded representation — it cannot, in principle, perform exact retrieval over arbitrary history the way attention can. TTT-E2E's failure on Needle-in-a-Haystack retrieval beyond 8,000 tokens (a method that shares the bounded-state philosophy) illustrates the class-level risk. For tasks requiring precise recall of specific tokens deep in a long context, hybrid architectures or attention-augmented designs may be preferable. For throughput-sensitive, long-sequence workloads where approximate recall suffices — streaming inference, on-device deployment, time-series — Mamba and its descendants remain a strong choice.

Where it's heading

Mamba is no longer the frontier architecture — Gated DeltaNet-2 and xLSTM have moved ahead on benchmarks — but it remains the reference point the field measures against, and the Mamba lineage (Mamba-3, hybrid Mamba-Transformer) continues to ship in production. The trajectory suggests the SSM design space will keep fragmenting into specialized variants optimized for particular hardware, context lengths, and task types, with Mamba's core ideas — selective gating, linear recurrence, hardware-aware parallelism — persisting as foundational primitives even as the specific architecture evolves.

Mamba lineage and the subquadratic architecture landscape

Subquadratic sequence architectures: Mamba and its competitors

Architecture	Key mechanism	Notable result (from events)	Status
Mamba-2	Structured SSM, selective gating	Inference speed baseline matched by TTT-E2E; improved by dynamic short convolutions	Deployed; superseded at frontier
Mamba-3	SSM (CMU + Together.AI, 1.5B)	Improved accuracy over Mamba-2	Research release
Gated DeltaNet	Delta-rule linear attention	Outperformed by Gated DeltaNet-2	Superseded
Gated DeltaNet-2	Decoupled erase/write gates, chunkwise WY	Beats Mamba-2, Mamba-3, KDA at 1.3B on RULER & commonsense	Current SOTA (linear attn)
xLSTM	Extended LSTM gating	Outperforms Mamba-2 and Gated DeltaNet on code, distillation, time-series	Research frontier
Kimi Delta Attention (KDA)	Delta-rule variant	Generalized by Gated DeltaNet-2	Superseded

All results from the provided events bundle; unknown cells render —.

Timeline

FAQ

Why does Mamba's linear-time recurrence matter for long sequences?

Standard attention scales quadratically with sequence length in both compute and memory, so doubling context roughly quadruples cost. Mamba's recurrent formulation keeps both constant per token, making very long sequences tractable without architectural tricks like sliding windows.

Has Mamba been deployed in real products?

Yes — Falcon Mamba (7B) from TII and Codestral Mamba (7.3B) from Mistral are publicly released models; NVIDIA's Nemotron 3 Nano 4B ships a hybrid Mamba-Transformer for on-device inference.

Is Mamba still state-of-the-art among subquadratic architectures?

No longer at the research frontier — Gated DeltaNet-2 and xLSTM both outperform Mamba-2 on recent benchmarks, though Mamba variants remain strong practical baselines and continue to improve (Mamba-3, dynamic short convolutions).

What is the difference between Mamba-2 and Mamba-3?

Mamba-3 (1.5B, from CMU and Together.AI) improves accuracy over Mamba-2; internal architectural changes are not detailed in the available events beyond that framing.

Can Mamba be used outside language modeling?

Yes — the events show Mamba-2 applied to eye-gaze time-series for cognitive load assessment (MambaGaze), running at 43–68 FPS on NVIDIA Jetson edge hardware, and Mamba-2 is also improved by dynamic short convolutions in mixture-of-experts settings.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

v1live6d ago

Related guides (4)

MambaConcept

Mamba: The Attention-Free Architecture That Scales Without Slowing Down

Read asBeginner

Alibaba

Alibaba's Qwen: The Open-Weight AI Lab Taking on the World's Frontier Models

Read asBeginner In-depth

Microsoft

Microsoft: The AI Infrastructure Giant Betting on Every Horse

Read asBeginner In-depth

NVIDIA

NVIDIA: The Hardware Backbone of the AI Era

Read asBeginner

More on Mamba (6)

7Mistral Ai News·19d ago·source ↗

Codestral Mamba: Mistral AI Releases Apache 2.0 Mamba-Architecture Code Model

Mistral AI has released Codestral Mamba, a 7.3B-parameter code-focused language model built on the Mamba state-space architecture rather than the Transformer architecture. The model offers linear-time inference and theoretically infinite sequence length, tested up to 256k tokens in-context retrieval. Developed with Mamba co-creators Albert Gu and Tri Dao, it is released under Apache 2.0 and available via HuggingFace, mistral-inference SDK, TensorRT-LLM, and Mistral's la Plateforme API. Mistral positions it as a local code assistant that performs on par with state-of-the-art transformer-based code models.

Long Context Evolution Frontier Model Releases Mistral AI Mamba Codestral 22B +9 more

7Hugging Face Blog·1mo ago·source ↗

Falcon Mamba: First Strong Attention-Free 7B Model

Technology Innovation Institute (TII) releases Falcon Mamba, a 7B parameter state space model (SSM) based on the Mamba architecture, announced as the first attention-free model at this scale to match or exceed transformer-based models on standard benchmarks. The model is hosted on Hugging Face and represents a significant milestone for SSM-based architectures competing with transformers. This release advances the case for pure SSM models as viable alternatives to attention-based LLMs at the 7B scale.

Frontier Model Releases Open Weights Progress Mamba Falcon Mamba Hugging Face +3 more

5arXiv · cs.LG·9d ago·source ↗

Comparative study finds xLSTM outperforms Mamba-2 and Gated DeltaNet on complex sequence tasks

A new arXiv paper compares three subquadratic sequence modeling architectures — xLSTM, Mamba-2, and Gated DeltaNet — across code model pre-training, LLM distillation, and time-series foundation model pre-training. xLSTM consistently delivers the strongest performance, which the authors attribute to more flexible and stable memory correction via its gating scheme. The paper provides a unified formulation and analysis of state tracking and memory dynamics across the three architectures, with corroborating results on synthetic length-generalization tasks.

Training Infrastructure Evaluation and Benchmarking Mamba On Subquadratic Architectures: From Applications to Principles Gated DeltaNet-2 +1 more

4arXiv · cs.AI·29d ago·source ↗

MambaGaze: Bidirectional Mamba with Explicit Missing Data Modeling for Cognitive Load Assessment from Eye-Gaze Tracking

MambaGaze is a framework for real-time cognitive load assessment from eye-tracking data, combining XMD encoding (observation masks and time-deltas for missing data) with bidirectional Mamba-2 for efficient long-range temporal modeling. Evaluated on CLARE and CL-Drive datasets under leave-one-subject-out protocol, it achieves 76.8% and 73.1% accuracy, outperforming CNN, Transformer, ResNet, and VGG baselines by 4-12 percentage points. Edge deployment on NVIDIA Jetson platforms achieves 43-68 FPS at under 7.5W, demonstrating feasibility for wearable and safety-critical applications such as driver vigilance monitoring.

Inference Economics Agent and Tool Ecosystem Mamba MambaGaze XMD encoding +3 more

3arXiv · cs.AI·23d ago·source ↗

CaMBRAIN: Real-time, Continuous EEG Inference with Causal State Space Models

CaMBRAIN is a Mamba-based causal state space model designed for real-time, continuous inference on variable-length EEG signals, addressing quadratic scaling limitations of attention-based models. It introduces a multi-stage self-supervised training pipeline for long-range memory retention and achieves state-of-the-art results across three EEG datasets with over 10x throughput improvement.

Long Context Evolution Mamba CaMBRAIN Self-Supervised Learning +2 more

7arXiv · cs.AI·29d ago·source ↗

Gated DeltaNet-2: Decoupling Erase and Write Gates in Linear Attention

Gated DeltaNet-2 is a new linear attention architecture from NVIDIA Labs that separates the erase and write operations in the delta-rule update into independent channel-wise gates, generalizing both Gated DeltaNet and Kimi Delta Attention (KDA). The model introduces a chunkwise WY algorithm with channel-wise decay and a gate-aware backward pass for efficient parallel training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, it outperforms Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants on language modeling, commonsense reasoning, and long-context RULER needle-in-a-haystack retrieval benchmarks. Code is publicly released via NVlabs on GitHub.

Training Infrastructure Long Context Evolution NVIDIA Labs Mamba WY Algorithm +7 more

At a glance

used_in: Language modeling, code generation, time-series, eye-tracking / edge inference
category: State space model (SSM) / subquadratic sequence architecture
key_idea: Selective structured state spaces: linear-time recurrence with input-dependent gating, no attention
maturity: Production-deployed (Falcon Mamba, Codestral Mamba, Nemotron 3 Nano); actively superseded at research frontier
alternatives: Transformers (full attention), Gated DeltaNet-2, xLSTM, Gated DeltaNet, Kimi Delta Attention (KDA)