What this area covers
Long-context evolution is the multi-year effort to let language models read, reason over, and act on very large inputs — from a few thousand tokens in 2020 to a million or more by 2026 — and, increasingly, to make that capacity reliable, affordable, and architecturally sound. The thread spans raw window expansion, the benchmarks that exposed its limits, the compression and attention techniques that attacked its cost, and the agent-level patterns that may ultimately supersede it.
Why it matters for practitioners
Context length is a hard ceiling on what a model can do in a single pass: how much code it can refactor, how many documents it can synthesize, how long an autonomous agent can run before losing its working memory. Every gain here widens the feasible application surface. But the ceiling is only one constraint — serving cost, effective recall, and architectural complexity are equally binding in production.
Phase 1: The window expansion race (2020–2023)
GPT-3's ~4K context was the baseline. The first meaningful break came in May 2023 when Anthropic expanded Claude's window roughly 11x — from 9K to 100K tokens — in a single step, explicitly framing it as a replacement for vector search on complex multi-document tasks. A partner demonstration processed a 58K-word podcast transcript in under a minute. Six months later, OpenAI's GPT-4 Turbo at DevDay (November 2023) brought 128K context to a major commercial model at reduced pricing, setting the commercial template: bigger window, lower cost per token, developer-facing API.
The open-weight ecosystem followed. Mistral 7B (September 2023) introduced Sliding Window Attention (SWA) to handle longer sequences at reduced cost — an early architectural signal that raw attention over long sequences was expensive enough to warrant alternatives. Mistral Large (February 2024) reached 32K; Mistral Large 2 (July 2024) reached 128K. Qwen2 (June 2024) brought 128K to open-weight models at the 7B and 72B scales. Llama 3.1 (July 2024) extended Meta's open-weight frontier with long-context support across its 8B, 70B, and 405B variants.
Phase 2: The 1M milestone and commoditization (2025–2026)
The 1M-token threshold arrived first in open-weight form: Alibaba's Qwen2.5-1M (January 2025) released 7B and 14B instruct models at that scale. By early 2026, the milestone had become a multi-lab commodity. GPT-5.4 (March 5, 2026) and Claude Opus 4.6 / Sonnet 4.6 (March 11, 2026) shipped 1M-token windows within days of each other. DeepSeek V4-Pro and V4-Flash both support 1M context by default. Qwen3-Coder (480B MoE, July 2025) supports 256K natively with 1M via extrapolation. Grok 4.3 shipped a 1M-token window at reduced pricing. The window number had ceased to be a differentiator.
Claude Opus 4.6 introduced context compaction — a mechanism for gracefully handling tasks that exceed even the 1M-token window by compressing earlier context — signaling that the practical problem is not just fitting data in, but managing what happens when you run out of room mid-task.
Phase 3: The efficiency and recall problem
Raw window size was never the whole story. Two structural problems emerged as windows grew:
Effective recall. Models do not attend uniformly across long inputs. Information in the middle of a long context is systematically underweighted — a pattern that benchmark suites like needle-in-a-haystack (NIAH) and RULER were designed to expose. Claude 3 Opus claimed 99%+ NIAH accuracy at 200K context as a selling point, precisely because recall quality at long range was not guaranteed. Gated DeltaNet-2 (NVIDIA, May 2026), a linear attention architecture, specifically benchmarks on RULER NIAH retrieval to demonstrate its long-context recall advantages over prior linear attention variants.
Serving cost. Full attention over a 1M-token window is quadratically expensive in compute and linearly expensive in KV-cache memory. Two architectural responses appear in the bundle:
- DeepSeek Sparse Attention (DSA): Introduced in DeepSeek V3.2-Exp, DSA is a fine-grained sparse attention mechanism that reduces compute during training and inference for long sequences. V3.2-Exp matches V3.1-Terminus on benchmarks while enabling a simultaneous 50%+ API price cut. DeepSeek V4 extends this with Token-wise compression alongside DSA as its default long-context architecture.
- Latent Context Language Models (LCLMs): A June 2026 research result demonstrates encoder-decoder compressors that map long token sequences to shorter latent embeddings at 1:4, 1:8, and 1:16 compression ratios, improving the Pareto frontier across general-task performance, compression speed, and peak memory. LCLMs are demonstrated as backbones for long-horizon agents that can skim compressed context and expand relevant segments on demand.
For video, AdaCodec (June 2026) applies the same principle to multimodal inputs: encoding only reference frames as full visual tokens and inter-frame changes as compact P-tokens, achieving 1/7 the token budget of a per-frame baseline while outperforming it on long-video benchmarks.
Phase 4: Agent architectures as an alternative
The most structurally significant development is the emergence of agent-level patterns that treat the context window as a resource to be managed programmatically rather than simply expanded.
Recursive Language Models (RLMs) (MIT, March 2026) offload long-context processing to an external Python REPL, where a root model spawns submodel instances over chunked text and aggregates their outputs recursively. Evaluated on documents up to 11 million tokens — far beyond any native window — RLM-GPT-5 achieved 91.3% on BrowseComp+ where the base model could not produce an answer, and ~50% accuracy on OOLONG-PAIRS at 1 million tokens versus near-zero for baseline approaches.
Recursive Agent Harnesses (RAH) (June 2026) formalize a related production pattern: a parent agent generates executable scripts that spawn parallel subagent harnesses with filesystem tools, code execution, and planning capabilities. On the Oolong-Synthetic long-context benchmark, RAH improved over the Codex baseline from 71.75% to 81.36% with GPT-5 as backbone, and reached 89.77% with Claude Sonnet 4.5.
The leaked Claude Code architecture (April 2026) corroborates this direction from the production side: the system uses a three-tier memory structure and multi-stage context compression, with an unreleased background agent (Kairos) featuring a dedicated memory-pruning subsystem (autoDream).
The open-weight dimension
Long-context capability has tracked closely with open-weight model progress. Mistral's trajectory — from Sliding Window Attention in the 7B (2023) to 128K in Mistral Large 2 (2024) to 256K in Devstral 2 and Mistral Medium 3.5 (2025–2026) — illustrates how open-weight labs have kept pace with closed-model window expansion while adding architectural innovations (configurable reasoning effort, MoE efficiency) that affect long-context serving economics. DeepSeek's open-weights releases of V3.2-Exp and V4 bring DSA-based sparse attention into the open ecosystem with full GPU kernel code.
Where the frontier sits
The 1M-token window is now table stakes. The active competition is on three axes:
1. Effective recall at scale — whether models actually use what they can see, measured by benchmarks like RULER and task-specific evaluations like BrowseComp and OOLONG. 2. Serving economics — sparse attention, KV-cache compression, and latent context models attacking the quadratic cost of full attention. 3. Agent orchestration — recursive and hierarchical agent patterns that extend effective context to tens of millions of tokens by managing retrieval and synthesis programmatically, potentially making the native window size a secondary concern for the hardest long-context tasks.




