Almanac
Topic guide · Beginner

Long Context Evolution: From Bigger Windows to Smarter Memory

Long Context EvolutionBeginneractive·v3 · live·generated 6d ago
TL;DRThe race to give AI models longer memories began as a simple numbers game — who could read the most text at once — but it has grown into something more nuanced. As context windows crossed the million-token mark, the real questions shifted to whether models actually use what they read, how much it costs to serve those giant inputs, and whether clever agent architectures might sidestep the problem entirely.

Key takeaways

  • In 2023, Anthropic's Claude jumped from 9K to 100K tokens in a single step — a roughly 10x leap that made processing full codebases or book-length documents practical for the first time.
  • By early 2026, multiple frontier models from Anthropic, OpenAI, DeepSeek, and Alibaba's Qwen team all offer 1M-token context windows, effectively ending the raw-size race.
  • DeepSeek's Sparse Attention (DSA) architecture shows that smarter attention mechanisms — not just bigger windows — can cut both compute cost and memory use during long-context inference.
  • Research on Recursive Language Models (RLMs) demonstrated reasoning over documents up to 11 million tokens by having models spawn sub-agents to handle chunks — far beyond any native window.
  • Latent Context Language Models (LCLMs) compress long inputs into shorter representations at ratios up to 1:16, improving the cost-vs-accuracy tradeoff for long-horizon agent tasks.
  • Benchmarks like RULER and needle-in-a-haystack tests reveal that a large context window on the spec sheet does not guarantee a model will reliably use information buried in the middle of it.

What this topic covers

"Long context" refers to how much text an AI model can read and reason about in a single session — its working memory. This topic tracks how that limit has grown from a few thousand words to a million or more, what problems that growth has exposed, and the new techniques researchers and companies are building to make large windows actually useful and affordable.

Why it matters to you

Imagine handing a colleague a 500-page report and asking them to summarize it, find contradictions, or answer specific questions. If they can only read 10 pages at a time, they'll miss connections. AI models face the same constraint. A bigger context window means the model can process an entire codebase, a year's worth of emails, or a lengthy legal contract in one pass — without you having to pre-select which parts matter.

How we got here: the size race

The modern era of large language models began with GPT-3 in 2020, which worked with roughly 4,000 tokens (about 3,000 words) at a time. That was enough for a short document but nowhere near a full book or codebase.

The first dramatic leap came in May 2023, when Anthropic expanded Claude's context window from 9,000 to 100,000 tokens — roughly 75,000 words — in a single step. Anthropic positioned this as better than traditional "vector search" (a technique for finding relevant chunks before showing them to the model) for tasks that require synthesizing many documents at once. A partner demonstrated it by feeding a 58,000-word podcast transcript and getting answers in under a minute.

OpenAI followed in November 2023 with GPT-4 Turbo and its 128,000-token window. By 2024, 128K had become table stakes: Mistral Large 2, Qwen2, and others all offered it. Claude 3 Opus pushed to 200,000 tokens and claimed near-perfect scores on "needle-in-a-haystack" tests — where a specific fact is hidden deep in a long document and the model must find it.

Then, in early 2025 and into 2026, the million-token milestone arrived across the board. Alibaba's Qwen team released the first open-weight models at 1M tokens. Anthropic shipped Claude Opus 4.6 and Sonnet 4.6 with 1M-token windows in beta. OpenAI's GPT-5.4 matched it. DeepSeek's V4 models offered 1M context by default. The size race, for practical purposes, was over.

The problem nobody advertised: do models actually use it?

Hitting a million tokens on the spec sheet turned out to be easier than using those tokens well. Benchmarks designed to probe this — including needle-in-a-haystack tests and the RULER evaluation suite — revealed a consistent pattern: models tend to pay close attention to the beginning and end of a long input, but information buried in the middle often gets ignored or misremembered.

This "lost in the middle" problem reframed the goal. The question stopped being "how long a window can you offer?" and became "how reliably can you use what's in it?"

Making long context cheaper: sparse attention and compression

Serving a million-token context is expensive. Every token in the window requires memory and computation at each generation step. Two families of techniques are attacking this cost:

Sparse attention selectively focuses on the most relevant parts of the context rather than attending to every token equally. DeepSeek's V3.2-Exp introduced DeepSeek Sparse Attention (DSA), a fine-grained mechanism that improves long-context performance while cutting compute costs — and accompanied it with a 50%+ API price reduction. NVIDIA Labs' Gated DeltaNet-2 takes a related approach with a linear attention architecture that outperforms several alternatives on RULER needle-in-a-haystack benchmarks.

KV-cache compression targets the memory bottleneck directly. Latent Context Language Models (LCLMs) compress long token sequences into shorter representations at ratios up to 1:16, then let a decoder work from those compact embeddings. Research shows LCLMs improve the tradeoff between accuracy and memory use, and can serve as efficient backbones for agents that need to skim compressed context and expand relevant sections on demand.

For video, a similar idea appears in AdaCodec, which sends full visual tokens only for key frames and encodes changes between frames as compact tokens — reducing the token budget by roughly 7x while matching or beating per-frame baselines on long-video benchmarks.

A different answer: agent architectures

Some researchers are asking whether you need a bigger window at all. The Recursive Language Models (RLM) framework from MIT researchers takes a different approach: instead of fitting everything into one context, a root model spawns sub-models to handle chunks of a task, then aggregates their outputs recursively. Tested on documents up to 11 million tokens, RLMs substantially outperformed both base models and retrieval-based approaches. A related pattern, the Recursive Agent Harness (RAH), extends this idea to coding agents, improving benchmark scores significantly over standard baselines.

The leaked architecture of Anthropic's Claude Code — revealed accidentally via a source map file — showed a similar philosophy in production: a modular, OS-like agent system with a three-tier memory structure and multi-stage context compression, suggesting that even companies with 1M-token models are building layered memory systems on top of them.

Where things stand

The field has moved through three phases in roughly six years: a size race (who can offer the biggest window), a usability reckoning (do models actually use what's in the window), and now a cost-and-architecture phase (how do you serve long contexts cheaply, and when is an agent smarter than a bigger window?).

For most practical applications today, a 128K–200K context window is sufficient, and the engineering challenge is less about raw size than about reliable recall, cost management, and deciding when to use retrieval, native context, or agent decomposition. The million-token frontier is real, but the work of making it genuinely useful — and affordable — is still underway.

The three phases of long-context evolution

Context window milestones across major model families

Model / ReleaseContext WindowKey technique or note
GPT-3 (2020)~4K tokensBaseline for the modern era
Claude (May 2023)100K tokens~10x jump; positioned as superior to vector search for multi-doc synthesis
GPT-4 Turbo (Nov 2023)128K tokensFirst major OpenAI long-context release
Mistral Large 2 / Qwen2 (2024)128K tokens128K becomes table stakes across open and closed models
Claude 3 Opus (2024)200K tokens99%+ needle-in-haystack accuracy claimed
Qwen2.5-1M / DeepSeek V4 / Claude Opus 4.6 / GPT-5.4 (2025–2026)1M tokensMultiple labs converge; DeepSeek uses Sparse Attention (DSA)

Milestones drawn from the events bundle; cells reflect what events report.

Timeline

  1. GPT-3 published — sets the modern baseline for language model context

  2. Anthropic expands Claude from 9K to 100K tokens

  3. OpenAI releases GPT-4 Turbo with 128K context window

  4. Qwen releases first open-weight models with 1M token context

  5. Anthropic ships Claude Opus 4.6 and Sonnet 4.6 with 1M-token context in beta

  6. Latent Context Language Models (LCLMs) demonstrate KV-cache compression at 1:16 ratio

Related topics

FAQ

What is a 'context window' and why does it matter?

A context window is the maximum amount of text an AI model can read and reason about in one go — think of it as the model's working memory. A bigger window means it can handle longer documents, bigger codebases, or longer conversations without forgetting earlier parts.

If a model has a 1M-token window, does it actually use all of it well?

Not always. Benchmarks like needle-in-a-haystack and RULER tests show that models often miss or ignore information placed in the middle of very long inputs, even when the window is technically large enough to hold it.

What's the difference between a big context window and retrieval (RAG)?

A big context window lets the model read everything at once; retrieval (RAG) pre-selects the most relevant chunks before the model sees anything. Each has tradeoffs — native long context is simpler but expensive and can miss mid-document details; retrieval is cheaper but depends on finding the right chunks upfront.

What are recursive agent architectures, and how do they relate to context length?

Instead of stuffing everything into one giant window, recursive agents break a huge task into pieces and spawn sub-agents to handle each piece, then combine the results. Research shows this approach can handle documents up to 11 million tokens — far beyond any current native window.

Why does serving a long context cost more?

Every token in the context window requires memory and computation at each step of generation — costs scale roughly with window size. Techniques like sparse attention and KV-cache compression aim to cut that cost without shrinking the window.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

  • v3live6d ago
  • v2superseded7d ago
  • v1superseded7d ago

Related guides (4)

More on Long Context Evolution (6)

8Anthropic News·17d ago·source ↗

Anthropic expands Claude context window from 9K to 100K tokens

Anthropic announced a roughly 10x expansion of Claude's context window, from 9K to 100K tokens (~75,000 words), available via API. The capability enables processing of hundreds of pages of documents, full codebases, or hours of transcribed audio in under a minute. Anthropic positions this as superior to vector search for complex multi-document synthesis tasks, and partner AssemblyAI demonstrated the feature on a 58K-word podcast transcript.

7Qwen Research·1mo ago·source ↗

Qwen2.5-1M: Open-Source Models with 1M Token Context Window Released

Alibaba's Qwen team has released two open-source models, Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, extending context length to 1 million tokens. This follows the earlier upgrade of the proprietary Qwen2.5-Turbo to 1M context two months prior. The release includes inference framework support for deployment, marking the first time Qwen's open-weight models have reached this context length.

7Qwen Research·1mo ago·source ↗

Qwen2.5-Turbo Extends Context Length to 1M Tokens

Alibaba's Qwen team has released Qwen2.5-Turbo, extending the model's context window from 128K to 1 million tokens (approximately 1 million English words). The update includes optimizations for both model capabilities and inference performance at extreme context lengths. The model is available via API and through HuggingFace and ModelScope demos.

7Qwen Research·1mo ago·source ↗

Generalizing an LLM from 8k to 1M Context using Qwen-Agent

Alibaba's Qwen team describes an agent built on Qwen2 (8k native context) that processes documents up to 1M tokens by decomposing retrieval and reasoning tasks, reportedly outperforming both RAG pipelines and native long-context models. The agent framework was also used to generate synthetic training data for fine-tuning new long-context Qwen models, creating a self-improvement loop. This positions agent-based context extension as a practical alternative to architectural long-context training.

9Deepseek News·1mo ago·source ↗

DeepSeek V4 Preview Release: 1.6T-param Pro and 284B Flash Models with 1M Context, Open-Sourced

DeepSeek has released DeepSeek-V4 as an open-weights preview, comprising two MoE variants: V4-Pro (1.6T total / 49B active parameters) and V4-Flash (284B total / 13B active parameters). Both models support 1M token context by default, enabled by a novel Token-wise compression and DeepSeek Sparse Attention (DSA) architecture. V4-Pro claims open-source SOTA on agentic coding benchmarks and world-class math/STEM/coding performance rivaling top closed-source models, while V4-Flash offers near-parity reasoning at lower cost and latency. The API is live today with OpenAI and Anthropic compatibility, and legacy model endpoints will be retired in July 2026.

6The Batch·19d ago·source ↗

Test-Time Training End-to-End (TTT-E2E) Retrains Model Weights to Handle Long Inputs

Researchers from Astera Institute, Nvidia, Stanford, UC Berkeley, and UC San Diego introduced TTT-E2E, a method that compresses long context into transformer weights by training the model during inference via meta-learning. The approach uses sliding-window attention restricted to 8,000 tokens and updates only the fully connected layers of the last quarter of the network on each 1,000-token chunk at inference time, keeping per-token generation latency roughly constant as context scales to 128,000 tokens. TTT-E2E slightly outperforms vanilla transformers on next-token prediction loss across long contexts and matches efficient architectures like Mamba 2 and Gated DeltaNet on inference speed, but fails dramatically on Needle-in-a-Haystack retrieval beyond 8,000 tokens and incurs substantially higher training latency. The work reframes long-context handling as a training-inference trade-off rather than an architectural design problem.