What this topic covers
"Long context" refers to how much text an AI model can read and reason about in a single session — its working memory. This topic tracks how that limit has grown from a few thousand words to a million or more, what problems that growth has exposed, and the new techniques researchers and companies are building to make large windows actually useful and affordable.
Why it matters to you
Imagine handing a colleague a 500-page report and asking them to summarize it, find contradictions, or answer specific questions. If they can only read 10 pages at a time, they'll miss connections. AI models face the same constraint. A bigger context window means the model can process an entire codebase, a year's worth of emails, or a lengthy legal contract in one pass — without you having to pre-select which parts matter.
How we got here: the size race
The modern era of large language models began with GPT-3 in 2020, which worked with roughly 4,000 tokens (about 3,000 words) at a time. That was enough for a short document but nowhere near a full book or codebase.
The first dramatic leap came in May 2023, when Anthropic expanded Claude's context window from 9,000 to 100,000 tokens — roughly 75,000 words — in a single step. Anthropic positioned this as better than traditional "vector search" (a technique for finding relevant chunks before showing them to the model) for tasks that require synthesizing many documents at once. A partner demonstrated it by feeding a 58,000-word podcast transcript and getting answers in under a minute.
OpenAI followed in November 2023 with GPT-4 Turbo and its 128,000-token window. By 2024, 128K had become table stakes: Mistral Large 2, Qwen2, and others all offered it. Claude 3 Opus pushed to 200,000 tokens and claimed near-perfect scores on "needle-in-a-haystack" tests — where a specific fact is hidden deep in a long document and the model must find it.
Then, in early 2025 and into 2026, the million-token milestone arrived across the board. Alibaba's Qwen team released the first open-weight models at 1M tokens. Anthropic shipped Claude Opus 4.6 and Sonnet 4.6 with 1M-token windows in beta. OpenAI's GPT-5.4 matched it. DeepSeek's V4 models offered 1M context by default. The size race, for practical purposes, was over.
The problem nobody advertised: do models actually use it?
Hitting a million tokens on the spec sheet turned out to be easier than using those tokens well. Benchmarks designed to probe this — including needle-in-a-haystack tests and the RULER evaluation suite — revealed a consistent pattern: models tend to pay close attention to the beginning and end of a long input, but information buried in the middle often gets ignored or misremembered.
This "lost in the middle" problem reframed the goal. The question stopped being "how long a window can you offer?" and became "how reliably can you use what's in it?"
Making long context cheaper: sparse attention and compression
Serving a million-token context is expensive. Every token in the window requires memory and computation at each generation step. Two families of techniques are attacking this cost:
Sparse attention selectively focuses on the most relevant parts of the context rather than attending to every token equally. DeepSeek's V3.2-Exp introduced DeepSeek Sparse Attention (DSA), a fine-grained mechanism that improves long-context performance while cutting compute costs — and accompanied it with a 50%+ API price reduction. NVIDIA Labs' Gated DeltaNet-2 takes a related approach with a linear attention architecture that outperforms several alternatives on RULER needle-in-a-haystack benchmarks.
KV-cache compression targets the memory bottleneck directly. Latent Context Language Models (LCLMs) compress long token sequences into shorter representations at ratios up to 1:16, then let a decoder work from those compact embeddings. Research shows LCLMs improve the tradeoff between accuracy and memory use, and can serve as efficient backbones for agents that need to skim compressed context and expand relevant sections on demand.
For video, a similar idea appears in AdaCodec, which sends full visual tokens only for key frames and encodes changes between frames as compact tokens — reducing the token budget by roughly 7x while matching or beating per-frame baselines on long-video benchmarks.
A different answer: agent architectures
Some researchers are asking whether you need a bigger window at all. The Recursive Language Models (RLM) framework from MIT researchers takes a different approach: instead of fitting everything into one context, a root model spawns sub-models to handle chunks of a task, then aggregates their outputs recursively. Tested on documents up to 11 million tokens, RLMs substantially outperformed both base models and retrieval-based approaches. A related pattern, the Recursive Agent Harness (RAH), extends this idea to coding agents, improving benchmark scores significantly over standard baselines.
The leaked architecture of Anthropic's Claude Code — revealed accidentally via a source map file — showed a similar philosophy in production: a modular, OS-like agent system with a three-tier memory structure and multi-stage context compression, suggesting that even companies with 1M-token models are building layered memory systems on top of them.
Where things stand
The field has moved through three phases in roughly six years: a size race (who can offer the biggest window), a usability reckoning (do models actually use what's in the window), and now a cost-and-architecture phase (how do you serve long contexts cheaply, and when is an agent smarter than a bigger window?).
For most practical applications today, a 128K–200K context window is sufficient, and the engineering challenge is less about raw size than about reliable recall, cost management, and deciding when to use retrieval, native context, or agent decomposition. The million-token frontier is real, but the work of making it genuinely useful — and affordable — is still underway.




