Almanac
Concept guide · Beginner

Mamba: The Attention-Free Architecture That Scales Without Slowing Down

MambaBeginneractive·v1 · live·generated 6d ago

Part of these paths

TL;DRMamba is a new kind of AI architecture that processes text and other sequences without the "attention" mechanism that powers most modern AI models. Because it doesn't need to compare every word to every other word, it stays fast even on very long inputs — and a growing ecosystem of real models and research variants shows it has moved well beyond a lab curiosity into practical use.

Key takeaways

  • Mamba is a state space model (SSM) — it compresses what it has read into a compact memory rather than re-examining the whole input every step.
  • Mistral's Codestral Mamba (7.3B parameters) runs linear-time inference and was tested on sequences up to 256,000 tokens long.
  • TII's Falcon Mamba became the first attention-free 7B model to match transformer-based models on standard benchmarks.
  • Mamba-3, released by CMU and Together.AI, improved accuracy over Mamba-2 at 1.5B parameters.
  • Newer architectures like Gated DeltaNet-2 have since outperformed Mamba-3 on language and retrieval benchmarks, showing the SSM space is actively evolving.
  • Mamba has been applied beyond text — MambaGaze uses it for real-time eye-tracking analysis on low-power edge hardware.

What Mamba is — and why it matters

Most AI language models today are built on a design called the Transformer, which works by letting every word in a sentence "pay attention" to every other word. That's powerful, but it has a cost: the more text you feed in, the slower and more expensive it gets — roughly quadratically. Feed in twice as many words and the work roughly quadruples.

Mamba takes a different approach. It belongs to a family called state space models (SSMs) — think of it as an AI that reads text like a person skimming a book, keeping a compact mental summary as it goes rather than flipping back to re-read every page. Because it only maintains a fixed-size "state" at each step, processing time grows linearly with length: twice the text, roughly twice the work. That's a big deal for long documents, long conversations, or any task where context keeps accumulating.

How it works (the plain version)

Imagine a conveyor belt of words passing by. A Transformer stops the belt and compares every item to every other item before deciding what to do next. Mamba instead keeps a small notepad — the "state" — and updates it as each word passes. It never looks back at the full belt; it just updates the notepad and moves on. The trick is making that notepad smart enough to remember what matters and forget what doesn't, which is what the "selective" part of Mamba's design handles.

From research to real models

Mamba has moved well past the whiteboard. A few milestones from the events in this bundle:

  • Codestral Mamba (Mistral AI, July 2024) is a 7.3-billion-parameter code assistant built entirely on the Mamba architecture. It runs with linear-time inference and was tested on sequences up to 256,000 tokens — far longer than most transformer-based models handle comfortably. It was developed with Mamba's co-creators and released under the permissive Apache 2.0 license.
  • Falcon Mamba (Technology Innovation Institute, August 2024) was the first attention-free model at the 7-billion-parameter scale to match or beat transformer-based models on standard benchmarks — a milestone that showed SSMs could compete head-to-head, not just in theory.
  • Mamba-3 (CMU and Together.AI, March 2026) pushed the architecture further, improving accuracy over Mamba-2 at 1.5 billion parameters.
  • Nemotron 3 Nano 4B (NVIDIA, March 2026) takes a hybrid approach — mixing Mamba and Transformer layers — for an on-device model designed to run efficiently on local hardware.

Where the competition stands

The SSM space is crowded and fast-moving. As of mid-2026, Gated DeltaNet-2 from NVIDIA Labs — which separates the "erase" and "write" steps in its memory update into independent gates — outperforms Mamba-3 on language modeling, commonsense reasoning, and long-context retrieval benchmarks. A separate study found xLSTM consistently beats Mamba-2 on complex sequence tasks including code and time-series modeling.

This isn't bad news for Mamba so much as a sign that the ideas it pioneered are being actively refined across the whole field. Mamba-style thinking — compact state, linear scaling — is now a standard ingredient that other architectures borrow from and build on.

Beyond language: edge AI and wearables

Mamba's efficiency makes it attractive wherever compute is tight. MambaGaze, a research framework for assessing cognitive load from eye-tracking data, uses a bidirectional Mamba-2 core and runs at 43–68 frames per second on NVIDIA Jetson edge hardware at under 7.5 watts — the kind of power budget relevant for wearable devices and driver-monitoring systems.

The honest tradeoff

Mamba is fast and memory-efficient, but it isn't a free lunch. Research on long-context retrieval — tasks like finding a specific sentence buried in a 100,000-word document — shows SSMs can struggle compared to attention-based models. The compact state that makes Mamba efficient also means it can lose track of details it didn't judge important when it first read them. Hybrid architectures (mixing Mamba and attention layers) are one active response to this limitation.

The bottom line

Mamba represents a genuine alternative to the Transformer for sequence modeling — one that trades some retrieval precision for significant speed and efficiency gains. It has real deployments, an active research community, and a growing family of descendants. Whether it or one of its successors eventually displaces attention-based models at the frontier is an open question, but the architecture has already changed what practitioners reach for when inference cost or context length is the binding constraint.

Transformer vs. Mamba: how they handle a growing sequence

Mamba and its SSM-family rivals at a glance

ArchitectureKey ideaNotable resultStatus
Mamba-2Selective state space, structured matricesBaseline for SSM comparisonsWidely used reference
Mamba-3 (CMU / Together.AI)Improved SSM at 1.5B paramsBetter accuracy than Mamba-2Released Mar 2026
Gated DeltaNet-2 (NVIDIA)Separate erase/write gates in delta-ruleOutperforms Mamba-3 on language & retrievalReleased May 2026
xLSTMFlexible gating / memory correctionBeats Mamba-2 & Gated DeltaNet on complex tasksActive research
Hybrid Mamba-TransformerMamba layers + attention layersNemotron 3 Nano 4B on-device modelIn production

All results from the events bundle; unknown cells render —.

Timeline

  1. Codestral Mamba released — first Mamba-based code model, linear-time inference up to 256K tokens

  2. Falcon Mamba: first attention-free 7B model to match transformers on standard benchmarks

  3. Mamba-3 released by CMU and Together.AI with improved accuracy over Mamba-2

  4. Gated DeltaNet-2 outperforms Mamba-3 on language modeling and long-context retrieval

Related topics

Gated DeltaNet-2NVIDIAFalcon Mambastate space modelKimi Delta AttentionFineWeb-Edu

FAQ

What's wrong with regular transformers that Mamba tries to fix?

Transformers use 'attention,' which compares every word to every other word in the input — a process that gets quadratically more expensive as text gets longer. Mamba sidesteps this by keeping a compact running summary of what it has read, so processing cost grows linearly with length instead.

Is Mamba actually used in real products, or just research?

Real products exist: Mistral's Codestral Mamba is a deployable code assistant, TII's Falcon Mamba is a 7B model on Hugging Face, and NVIDIA's Nemotron 3 Nano 4B is a hybrid Mamba-Transformer designed for on-device use.

Does Mamba handle very long documents well?

It is fast on long inputs, but retrieval benchmarks (like 'needle in a haystack') reveal a weakness: finding a specific fact buried deep in a long context is harder for SSMs than for attention-based models, and this remains an active research problem.

Is Mamba the best attention-free architecture now?

Not necessarily — the field is moving fast. As of mid-2026, Gated DeltaNet-2 from NVIDIA outperforms Mamba-3 on language and retrieval tasks, and xLSTM beats Mamba-2 on complex sequence tasks in head-to-head comparisons.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

  • v1live6d ago

Related guides (4)

More on Mamba (6)

7Mistral Ai News·19d ago·source ↗

Codestral Mamba: Mistral AI Releases Apache 2.0 Mamba-Architecture Code Model

Mistral AI has released Codestral Mamba, a 7.3B-parameter code-focused language model built on the Mamba state-space architecture rather than the Transformer architecture. The model offers linear-time inference and theoretically infinite sequence length, tested up to 256k tokens in-context retrieval. Developed with Mamba co-creators Albert Gu and Tri Dao, it is released under Apache 2.0 and available via HuggingFace, mistral-inference SDK, TensorRT-LLM, and Mistral's la Plateforme API. Mistral positions it as a local code assistant that performs on par with state-of-the-art transformer-based code models.

7Hugging Face Blog·1mo ago·source ↗

Falcon Mamba: First Strong Attention-Free 7B Model

Technology Innovation Institute (TII) releases Falcon Mamba, a 7B parameter state space model (SSM) based on the Mamba architecture, announced as the first attention-free model at this scale to match or exceed transformer-based models on standard benchmarks. The model is hosted on Hugging Face and represents a significant milestone for SSM-based architectures competing with transformers. This release advances the case for pure SSM models as viable alternatives to attention-based LLMs at the 7B scale.

5arXiv · cs.LG·9d ago·source ↗

Comparative study finds xLSTM outperforms Mamba-2 and Gated DeltaNet on complex sequence tasks

A new arXiv paper compares three subquadratic sequence modeling architectures — xLSTM, Mamba-2, and Gated DeltaNet — across code model pre-training, LLM distillation, and time-series foundation model pre-training. xLSTM consistently delivers the strongest performance, which the authors attribute to more flexible and stable memory correction via its gating scheme. The paper provides a unified formulation and analysis of state tracking and memory dynamics across the three architectures, with corroborating results on synthetic length-generalization tasks.

4arXiv · cs.AI·29d ago·source ↗

MambaGaze: Bidirectional Mamba with Explicit Missing Data Modeling for Cognitive Load Assessment from Eye-Gaze Tracking

MambaGaze is a framework for real-time cognitive load assessment from eye-tracking data, combining XMD encoding (observation masks and time-deltas for missing data) with bidirectional Mamba-2 for efficient long-range temporal modeling. Evaluated on CLARE and CL-Drive datasets under leave-one-subject-out protocol, it achieves 76.8% and 73.1% accuracy, outperforming CNN, Transformer, ResNet, and VGG baselines by 4-12 percentage points. Edge deployment on NVIDIA Jetson platforms achieves 43-68 FPS at under 7.5W, demonstrating feasibility for wearable and safety-critical applications such as driver vigilance monitoring.

3arXiv · cs.AI·23d ago·source ↗

CaMBRAIN: Real-time, Continuous EEG Inference with Causal State Space Models

CaMBRAIN is a Mamba-based causal state space model designed for real-time, continuous inference on variable-length EEG signals, addressing quadratic scaling limitations of attention-based models. It introduces a multi-stage self-supervised training pipeline for long-range memory retention and achieves state-of-the-art results across three EEG datasets with over 10x throughput improvement.

7arXiv · cs.AI·29d ago·source ↗

Gated DeltaNet-2: Decoupling Erase and Write Gates in Linear Attention

Gated DeltaNet-2 is a new linear attention architecture from NVIDIA Labs that separates the erase and write operations in the delta-rule update into independent channel-wise gates, generalizing both Gated DeltaNet and Kimi Delta Attention (KDA). The model introduces a chunkwise WY algorithm with channel-wise decay and a gate-aware backward pass for efficient parallel training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, it outperforms Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants on language modeling, commonsense reasoning, and long-context RULER needle-in-a-haystack retrieval benchmarks. Code is publicly released via NVlabs on GitHub.