Almanac
Topic guide · In-depth

Long Context Evolution: From Window Size to Usable, Affordable Memory

Long Context EvolutionIn-depthactive·v4 · live·generated 6d ago
TL;DRThe long-context race began as a contest over raw token counts — a few thousand in 2020, then tens of thousands, then a million — but the binding problems shifted well before the numbers stopped climbing. Once the 1M-token window became a commodity feature across closed and open-weight models alike, the real competition moved to whether models actually use what they can see, how much it costs to serve those windows, and whether architectural alternatives like sparse attention, KV-cache compression, or recursive agent patterns can sidestep the problem entirely.

Key takeaways

  • GPT-4 Turbo (Nov 2023) was the first major model to ship a 128K context window at reduced pricing, setting the commercial template for long-context expansion.
  • Claude's context window grew roughly 11x in a single step in May 2023 — from 9K to 100K tokens — with Anthropic explicitly positioning it as superior to vector search for multi-document synthesis.
  • By early 2026, the 1M-token window had become a multi-lab commodity: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.4, DeepSeek V4, Qwen2.5-1M, and Qwen3-Coder all ship it.
  • DeepSeek's Sparse Attention (DSA) architecture, introduced in V3.2-Exp, targets long-context efficiency directly — achieving parity with V3.1-Terminus while cutting API prices by more than 50%.
  • Latent Context Language Models (LCLMs) demonstrate encoder-decoder KV-cache compression at 1:4 to 1:16 ratios, improving the accuracy-efficiency Pareto frontier for long-horizon agents.
  • Recursive Agent Harnesses (RAH) and Recursive Language Models (RLMs) show that agent-level orchestration — spawning parallel subagents over chunked context — can outperform native long-context on benchmarks reaching 11 million tokens.

What this area covers

Long-context evolution is the multi-year effort to let language models read, reason over, and act on very large inputs — from a few thousand tokens in 2020 to a million or more by 2026 — and, increasingly, to make that capacity reliable, affordable, and architecturally sound. The thread spans raw window expansion, the benchmarks that exposed its limits, the compression and attention techniques that attacked its cost, and the agent-level patterns that may ultimately supersede it.

Why it matters for practitioners

Context length is a hard ceiling on what a model can do in a single pass: how much code it can refactor, how many documents it can synthesize, how long an autonomous agent can run before losing its working memory. Every gain here widens the feasible application surface. But the ceiling is only one constraint — serving cost, effective recall, and architectural complexity are equally binding in production.

Phase 1: The window expansion race (2020–2023)

GPT-3's ~4K context was the baseline. The first meaningful break came in May 2023 when Anthropic expanded Claude's window roughly 11x — from 9K to 100K tokens — in a single step, explicitly framing it as a replacement for vector search on complex multi-document tasks. A partner demonstration processed a 58K-word podcast transcript in under a minute. Six months later, OpenAI's GPT-4 Turbo at DevDay (November 2023) brought 128K context to a major commercial model at reduced pricing, setting the commercial template: bigger window, lower cost per token, developer-facing API.

The open-weight ecosystem followed. Mistral 7B (September 2023) introduced Sliding Window Attention (SWA) to handle longer sequences at reduced cost — an early architectural signal that raw attention over long sequences was expensive enough to warrant alternatives. Mistral Large (February 2024) reached 32K; Mistral Large 2 (July 2024) reached 128K. Qwen2 (June 2024) brought 128K to open-weight models at the 7B and 72B scales. Llama 3.1 (July 2024) extended Meta's open-weight frontier with long-context support across its 8B, 70B, and 405B variants.

Phase 2: The 1M milestone and commoditization (2025–2026)

The 1M-token threshold arrived first in open-weight form: Alibaba's Qwen2.5-1M (January 2025) released 7B and 14B instruct models at that scale. By early 2026, the milestone had become a multi-lab commodity. GPT-5.4 (March 5, 2026) and Claude Opus 4.6 / Sonnet 4.6 (March 11, 2026) shipped 1M-token windows within days of each other. DeepSeek V4-Pro and V4-Flash both support 1M context by default. Qwen3-Coder (480B MoE, July 2025) supports 256K natively with 1M via extrapolation. Grok 4.3 shipped a 1M-token window at reduced pricing. The window number had ceased to be a differentiator.

Claude Opus 4.6 introduced context compaction — a mechanism for gracefully handling tasks that exceed even the 1M-token window by compressing earlier context — signaling that the practical problem is not just fitting data in, but managing what happens when you run out of room mid-task.

Phase 3: The efficiency and recall problem

Raw window size was never the whole story. Two structural problems emerged as windows grew:

Effective recall. Models do not attend uniformly across long inputs. Information in the middle of a long context is systematically underweighted — a pattern that benchmark suites like needle-in-a-haystack (NIAH) and RULER were designed to expose. Claude 3 Opus claimed 99%+ NIAH accuracy at 200K context as a selling point, precisely because recall quality at long range was not guaranteed. Gated DeltaNet-2 (NVIDIA, May 2026), a linear attention architecture, specifically benchmarks on RULER NIAH retrieval to demonstrate its long-context recall advantages over prior linear attention variants.

Serving cost. Full attention over a 1M-token window is quadratically expensive in compute and linearly expensive in KV-cache memory. Two architectural responses appear in the bundle:

  • DeepSeek Sparse Attention (DSA): Introduced in DeepSeek V3.2-Exp, DSA is a fine-grained sparse attention mechanism that reduces compute during training and inference for long sequences. V3.2-Exp matches V3.1-Terminus on benchmarks while enabling a simultaneous 50%+ API price cut. DeepSeek V4 extends this with Token-wise compression alongside DSA as its default long-context architecture.
  • Latent Context Language Models (LCLMs): A June 2026 research result demonstrates encoder-decoder compressors that map long token sequences to shorter latent embeddings at 1:4, 1:8, and 1:16 compression ratios, improving the Pareto frontier across general-task performance, compression speed, and peak memory. LCLMs are demonstrated as backbones for long-horizon agents that can skim compressed context and expand relevant segments on demand.

For video, AdaCodec (June 2026) applies the same principle to multimodal inputs: encoding only reference frames as full visual tokens and inter-frame changes as compact P-tokens, achieving 1/7 the token budget of a per-frame baseline while outperforming it on long-video benchmarks.

Phase 4: Agent architectures as an alternative

The most structurally significant development is the emergence of agent-level patterns that treat the context window as a resource to be managed programmatically rather than simply expanded.

Recursive Language Models (RLMs) (MIT, March 2026) offload long-context processing to an external Python REPL, where a root model spawns submodel instances over chunked text and aggregates their outputs recursively. Evaluated on documents up to 11 million tokens — far beyond any native window — RLM-GPT-5 achieved 91.3% on BrowseComp+ where the base model could not produce an answer, and ~50% accuracy on OOLONG-PAIRS at 1 million tokens versus near-zero for baseline approaches.

Recursive Agent Harnesses (RAH) (June 2026) formalize a related production pattern: a parent agent generates executable scripts that spawn parallel subagent harnesses with filesystem tools, code execution, and planning capabilities. On the Oolong-Synthetic long-context benchmark, RAH improved over the Codex baseline from 71.75% to 81.36% with GPT-5 as backbone, and reached 89.77% with Claude Sonnet 4.5.

The leaked Claude Code architecture (April 2026) corroborates this direction from the production side: the system uses a three-tier memory structure and multi-stage context compression, with an unreleased background agent (Kairos) featuring a dedicated memory-pruning subsystem (autoDream).

The open-weight dimension

Long-context capability has tracked closely with open-weight model progress. Mistral's trajectory — from Sliding Window Attention in the 7B (2023) to 128K in Mistral Large 2 (2024) to 256K in Devstral 2 and Mistral Medium 3.5 (2025–2026) — illustrates how open-weight labs have kept pace with closed-model window expansion while adding architectural innovations (configurable reasoning effort, MoE efficiency) that affect long-context serving economics. DeepSeek's open-weights releases of V3.2-Exp and V4 bring DSA-based sparse attention into the open ecosystem with full GPU kernel code.

Where the frontier sits

The 1M-token window is now table stakes. The active competition is on three axes:

1. Effective recall at scale — whether models actually use what they can see, measured by benchmarks like RULER and task-specific evaluations like BrowseComp and OOLONG. 2. Serving economics — sparse attention, KV-cache compression, and latent context models attacking the quadratic cost of full attention. 3. Agent orchestration — recursive and hierarchical agent patterns that extend effective context to tens of millions of tokens by managing retrieval and synthesis programmatically, potentially making the native window size a secondary concern for the hardest long-context tasks.

Long-context evolution: from window expansion to architectural alternatives

Context window milestones across major models

Model / ReleaseContext WindowKey mechanism / noteDate
GPT-3 (OpenAI)~4K tokensBaseline autoregressive LLM2020-05
Claude (Anthropic)100K tokens~11x jump from 9K; positioned vs. vector search2023-05
GPT-4 Turbo (OpenAI)128K tokensReduced pricing; first major 128K commercial model2023-11
Mistral Large 2128K tokens123B params; multilingual + code2024-07
Qwen2 (7B/72B instruct)128K tokensOpen-weight; 27 additional languages2024-06
Qwen2.5-1M (open-weight)1M tokensFirst open-weight Qwen at 1M context2025-01
Claude Opus 4.6 / Sonnet 4.61M tokens (beta)Context compaction for long-running tasks2026-03
GPT-5.4 (OpenAI)1M tokensCoding + computer use focus2026-03
DeepSeek V4 (Pro & Flash)1M tokens (default)Token-wise compression + DSA architecture

Dates from event published_at; unknown cells render —.

Timeline

  1. GPT-3 establishes the autoregressive LLM baseline (~4K context)

  2. Claude expands from 9K to 100K tokens — first major long-context leap

  3. GPT-4 Turbo ships 128K context at reduced pricing

  4. Qwen2.5-1M: first open-weight models at 1M context

  5. GPT-5.4 ships 1M context; 1M window becomes multi-lab commodity

  6. LCLMs demonstrate competitive KV-cache compression at 1:4–1:16 ratios

Related topics

FAQ

Is a 1M-token context window actually usable end-to-end?

Not uniformly — models have demonstrated recall failures on information buried in the middle of long inputs (the 'lost-in-the-middle' problem), which is why techniques like context compaction (Anthropic) and sparse attention (DeepSeek DSA) exist alongside raw window expansion.

What is the difference between native long-context and retrieval-augmented generation (RAG)?

Native long-context loads the full document set into the model's attention window; RAG retrieves only relevant chunks via vector search before inference. Anthropic's 2023 100K announcement explicitly positioned native long-context as superior for complex multi-document synthesis, but cost and recall quality remain active tradeoffs.

What are Recursive Language Models and why do they matter for long context?

RLMs (MIT, 2026) offload long-context processing to an external Python environment where a root model spawns submodel instances over chunked text, achieving results on documents up to 11 million tokens — far beyond any native window — with RLM-GPT-5 reaching 91.3% on BrowseComp+ where the base model could not produce an answer.

How does DeepSeek Sparse Attention (DSA) address long-context cost?

DSA is a fine-grained sparse attention mechanism introduced in DeepSeek V3.2-Exp that reduces compute during both training and inference for long sequences, achieving parity with V3.1-Terminus while enabling a simultaneous 50%+ API price cut.

When did the 1M-token window stop being a differentiator?

By March 2026, when Claude Opus 4.6, Claude Sonnet 4.6, and GPT-5.4 all shipped 1M-token windows within days of each other, and DeepSeek V4 and Qwen2.5-1M had already reached that mark — making it a baseline expectation rather than a competitive moat.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

  • v4live6d ago
  • v3superseded7d ago
  • v2superseded7d ago
  • v1superseded7d ago

Related guides (4)

More on Long Context Evolution (6)

8Anthropic News·17d ago·source ↗

Anthropic expands Claude context window from 9K to 100K tokens

Anthropic announced a roughly 10x expansion of Claude's context window, from 9K to 100K tokens (~75,000 words), available via API. The capability enables processing of hundreds of pages of documents, full codebases, or hours of transcribed audio in under a minute. Anthropic positions this as superior to vector search for complex multi-document synthesis tasks, and partner AssemblyAI demonstrated the feature on a 58K-word podcast transcript.

7Qwen Research·1mo ago·source ↗

Qwen2.5-1M: Open-Source Models with 1M Token Context Window Released

Alibaba's Qwen team has released two open-source models, Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, extending context length to 1 million tokens. This follows the earlier upgrade of the proprietary Qwen2.5-Turbo to 1M context two months prior. The release includes inference framework support for deployment, marking the first time Qwen's open-weight models have reached this context length.

7Qwen Research·1mo ago·source ↗

Qwen2.5-Turbo Extends Context Length to 1M Tokens

Alibaba's Qwen team has released Qwen2.5-Turbo, extending the model's context window from 128K to 1 million tokens (approximately 1 million English words). The update includes optimizations for both model capabilities and inference performance at extreme context lengths. The model is available via API and through HuggingFace and ModelScope demos.

7Qwen Research·1mo ago·source ↗

Generalizing an LLM from 8k to 1M Context using Qwen-Agent

Alibaba's Qwen team describes an agent built on Qwen2 (8k native context) that processes documents up to 1M tokens by decomposing retrieval and reasoning tasks, reportedly outperforming both RAG pipelines and native long-context models. The agent framework was also used to generate synthetic training data for fine-tuning new long-context Qwen models, creating a self-improvement loop. This positions agent-based context extension as a practical alternative to architectural long-context training.

9Deepseek News·1mo ago·source ↗

DeepSeek V4 Preview Release: 1.6T-param Pro and 284B Flash Models with 1M Context, Open-Sourced

DeepSeek has released DeepSeek-V4 as an open-weights preview, comprising two MoE variants: V4-Pro (1.6T total / 49B active parameters) and V4-Flash (284B total / 13B active parameters). Both models support 1M token context by default, enabled by a novel Token-wise compression and DeepSeek Sparse Attention (DSA) architecture. V4-Pro claims open-source SOTA on agentic coding benchmarks and world-class math/STEM/coding performance rivaling top closed-source models, while V4-Flash offers near-parity reasoning at lower cost and latency. The API is live today with OpenAI and Anthropic compatibility, and legacy model endpoints will be retired in July 2026.

6The Batch·19d ago·source ↗

Test-Time Training End-to-End (TTT-E2E) Retrains Model Weights to Handle Long Inputs

Researchers from Astera Institute, Nvidia, Stanford, UC Berkeley, and UC San Diego introduced TTT-E2E, a method that compresses long context into transformer weights by training the model during inference via meta-learning. The approach uses sliding-window attention restricted to 8,000 tokens and updates only the fully connected layers of the last quarter of the network on each 1,000-token chunk at inference time, keeping per-token generation latency roughly constant as context scales to 128,000 tokens. TTT-E2E slightly outperforms vanilla transformers on next-token prediction loss across long contexts and matches efficient architectures like Mamba 2 and Gated DeltaNet on inference speed, but fails dramatically on Needle-in-a-Haystack retrieval beyond 8,000 tokens and incurs substantially higher training latency. The work reframes long-context handling as a training-inference trade-off rather than an architectural design problem.