Step 8 of 10 in Evaluation and Benchmarking in Modern AINext: DeepSeek V4 →

Guide · In-depth

Claude Opus 4.6: Anthropic's Long-Context Agentic Frontier Model

Claude Opus 4.6In-depthactive·v2 · live·generated 6d ago

Part of these paths

AI Safety Research · Step 3 of 6
Enterprise Deployment Patterns · Step 5 of 12
Evaluation and Benchmarking · Step 8 of 10
Frontier Model Releases · Step 7 of 10
Long Context Evolution · Step 6 of 10
Multimodal Progress · Step 7 of 7
The reasoning-model era · Step 7 of 7

TL;DRClaude Opus 4.6 is the model that pushed Anthropic's Opus line into long-horizon agentic territory — pairing a 1M-token context window with adaptive reasoning and multi-agent orchestration. It established a new benchmark ceiling at its release, demonstrated real-world offensive security capability against Firefox, and served as the foundation for a cascade of successor models and safety-tiered deployments that followed.

Key takeaways

1M-token context window (beta) with developer-controlled adaptive thinking effort and context compaction for tasks that exceed even that limit.
Claimed top scores on Terminal-Bench 2.0, Humanity's Last Exam, GDPval-AA (+144 Elo over GPT-5.2), and BrowseComp at launch.
Pricing held at $5/$25 per million input/output tokens — same as its predecessor Claude Opus 4.5.
In a two-week Mozilla partnership, Opus 4.6 identified 22 Firefox vulnerabilities, 14 classified high-severity, scanning ~6,000 C++ files and filing 112 unique reports.
Topped the AutoLab benchmark of 36 ultra long-horizon research and engineering tasks, where persistence under wall-clock budgets — not initial attempt quality — was the dominant success predictor.
Directly preceded Claude Mythos Preview, which substantially outperformed Opus 4.6 and triggered the Project Glasswing cybersecurity consortium.

What it is

Claude Opus 4.6 is Anthropic's flagship large language model released in March 2026, succeeding Claude Opus 4.5 in the Opus line. Its defining additions are a 1M-token context window (in beta), adaptive thinking with developer-controlled effort levels, and a suite of agentic features — agent teams in Claude Code, context compaction for tasks that overflow even the extended window — designed to make long-horizon, multi-step autonomous work practical rather than theoretical.

Benchmark position at launch

At release, Opus 4.6 claimed first place on Terminal-Bench 2.0, Humanity's Last Exam, GDPval-AA (by 144 Elo over GPT-5.2), and BrowseComp. Its lineage matters for context: Claude Opus 4 had established 72.5% on SWE-bench and 43.2% on Terminal-bench, and Claude Opus 4.5 had nearly saturated CyberGym — the internal security benchmark that prompted Anthropic to test against harder real-world targets. Opus 4.6 extended those gains while holding pricing flat at $5/$25 per million input/output tokens.

GPT-5.4, released two days after Opus 4.6, subsequently leapfrogged it on most benchmarks, and Claude Mythos Preview — published without commercial availability — substantially outperformed Opus 4.6 across CyberGym (83.1%), Terminal-Bench 2.0 (82%), GPQA Diamond (94.5%), and HLE (64.7%). Opus 4.6's benchmark lead was therefore short-lived in absolute terms, though it remained the strongest commercially available Anthropic model for several weeks.

Architecture and capabilities

The events bundle does not disclose internal architecture. Externally observable capabilities include:

1M-token context window (beta) with context compaction for graceful overflow handling
Adaptive thinking: developer-controlled effort levels that trade latency and compute against reasoning depth per call
Agent teams in Claude Code: coordinated multi-agent orchestration for large-scale engineering tasks
Parallel tool execution and local file access for persistent memory across sessions (inherited from the Opus 4 line)

The AutoLab benchmark — 36 expert-curated tasks across system optimization, puzzle-solving, model development, and CUDA kernel optimization, evaluated under wall-clock budgets — found Opus 4.6 the strongest performer across 17 frontier models. The benchmark's key finding is that persistence in iterative feedback loops, not initial attempt quality, predicts success; Opus 4.6 stood out precisely on that dimension.

Real-world security capability

The most consequential capability demonstration came from a two-week partnership with Mozilla in February 2026. Claude Opus 4.6 scanned approximately 6,000 C++ files in the Firefox codebase, submitted 112 unique vulnerability reports, and identified 22 vulnerabilities — 14 of which Mozilla classified as high-severity. That figure represented nearly a fifth of all high-severity Firefox vulnerabilities remediated in 2025. The collaboration also included evaluation of Claude's ability to write primitive exploits, probing the upper limits of AI-enabled offensive security.

This real-world result directly informed Anthropic's decision to publish a model card for Claude Mythos Preview without commercial release, and to assemble Project Glasswing — a consortium of 40+ organizations including AWS, Apple, Google, Microsoft, and CrowdStrike, funded with $100M in API credits — to patch vulnerabilities proactively before Mythos-class capabilities became widely available.

Ecosystem integration

Opus 4.6 is distributed via the Anthropic API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. Claude Code — generally available since the Opus 4 launch — uses Opus 4.6 as its primary model for autonomous coding sessions, with integrations across GitHub Actions, VS Code, and JetBrains. Claude Code Security, released in limited research preview in February 2026, is explicitly built on Opus 4.6 and found over 500 previously undetected vulnerabilities in production open-source codebases during internal research.

The compute infrastructure underpinning Opus 4.6's deployment includes Anthropic's 10-year, $100B+ Amazon Bedrock commitment (up to 5 GW on Trainium chips), a $30B Azure compute deal with Microsoft, and access to NVIDIA Grace Blackwell and Vera Rubin systems — all of which also support rate-limit expansions for Claude Code and API users.

Safety posture

Opus 4.6 sits within Anthropic's ASL-3 deployment framework, first activated for Claude Opus 4. ASL-3 measures include Constitutional Classifiers blocking end-to-end CBRN weapon development workflows and enhanced model-weight security. Separately, research using Opus 4.6 as the untrusted model in AI control experiments (BashArena benchmark) found that resampling strategies — drawing five samples per step and auditing on maximum suspicion score — raised safety from 61% to 71% at a 0.3% audit budget, with retrying shown to backfire by leaking monitor rationale to the model.

Successor trajectory

Opus 4.6 was followed by Opus 4.7 (which added Project Glasswing cybersecurity safeguards and a Cyber Verification Program for legitimate security professionals), Opus 4.8 (which improved uncertainty flagging and introduced dynamic parallel subagent workflows), and eventually the Mythos-class models. Each successor was explicitly positioned relative to Opus 4.6 as a baseline — making it the reference point for the capability escalation that defined Anthropic's mid-2026 model roadmap.

Claude Opus 4.6: capability lineage and deployment footprint

Opus 4.6 in the Claude Opus lineage and against key rivals

Model	Context window	Key benchmark result	Pricing (input/output per M tokens)	Notable
Claude Opus 4	200K	72.5% SWE-bench, 43.2% Terminal-bench	$15 / $75	Hybrid thinking, parallel tools; first ASL-3 deployment
Claude Opus 4.5	200K	Near-saturated CyberGym; best-in-class coding at launch	$5 / $25	65% token efficiency gain; computer use
Claude Opus 4.6	1M (beta)	SOTA Terminal-Bench 2.0, HLE, GDPval-AA (+144 Elo vs GPT-5.2), BrowseComp	$5 / $25	Adaptive effort, agent teams, context compaction
Claude Opus 4.7	—	Leads Vals AI Finance Agent benchmark at 64.37%	$5 / $25	First model with Project Glasswing cybersecurity safeguards
GPT-5.2	—	Trailed Opus 4.6 by 144 Elo on GDPval-AA	—	—
GPT-5.4	1.05M	Leapfrogged Opus 4.6 on most benchmarks post-release	$30 / $180 (Pro)	Native computer use, tool search

All figures from the events bundle; unknown cells render —. GPT-5.4 released after Opus 4.6 and is included for competitive context.

Timeline

FAQ

How does Opus 4.6 handle inputs longer than 1M tokens?

Context compaction is built into the release — the model compresses earlier context to sustain long-running agentic tasks that would otherwise overflow even the 1M-token window.

What is adaptive thinking and how does a developer control it?

Adaptive thinking lets the model scale its reasoning effort per query; developers set the effort level via API, trading latency and cost against depth of reasoning on a per-call basis.

Is Opus 4.6 still the top Anthropic model?

No — it was succeeded by Opus 4.7 (which added cybersecurity safeguards) and then Opus 4.8, and sits below the Mythos-class models that Anthropic published a model card for without commercial release.

What made the Mozilla Firefox collaboration significant?

It was a real-world demonstration of Opus 4.6's offensive security capability: 22 vulnerabilities found in two weeks across ~6,000 C++ files, with 14 rated high-severity — nearly a fifth of all high-severity Firefox vulnerabilities remediated in 2025.

Where can Opus 4.6 be accessed?

Via the Anthropic API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry, as well as Claude Code for autonomous coding workflows.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

v2live6d ago
v1rejected16d ago

Related guides (4)

Claude Opus 4.6

Claude Opus 4.6: Anthropic's Milestone Model for Long-Context and Agentic Work

Read asBeginner

Claude

Claude: Anthropic's AI Assistant Built for Safety and Scale

Read asBeginner In-depth

Claude Code

Claude Code: Anthropic's Autonomous Coding Agent

Read asBeginner In-depthfeatured

GPT-5.5

GPT-5.5: OpenAI's Benchmark-Leading Agentic Model with a Hallucination Problem

Read asIn-depth

More on Claude Opus 4.6 (6)

8Anthropic News·1mo ago·source ↗

Anthropic Releases Claude Opus 4.7 with Enhanced Coding, Vision, and Cyber Safeguards

Anthropic has released Claude Opus 4.7, a general-availability model positioned as a meaningful improvement over Opus 4.6 in advanced software engineering, long-horizon agentic tasks, and vision capabilities including higher image resolution. The model is notably the first to receive new cybersecurity safeguards developed in response to Project Glasswing, with automatic detection and blocking of prohibited cyber uses and a new Cyber Verification Program for legitimate security professionals. Opus 4.7 is available across Claude products, API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry at the same pricing as Opus 4.6 ($5/$25 per million input/output tokens). The release is explicitly positioned below Claude Mythos Preview in overall capability, serving as a testbed for safety mechanisms before broader deployment of Mythos-class models.

Frontier Model Releases Evaluation and Benchmarking Harvey Solve Intelligence Amazon Bedrock +16 more

8Hacker News·23d ago·source ↗

Claude Opus 4.8 Released by Anthropic

Anthropic has released Claude Opus 4.8, a new frontier model in their Claude lineup. The announcement appeared on Anthropic's official news page and generated significant community engagement on Hacker News with over 1,000 points and 800+ comments. Specific capability details and benchmarks are not available from the source snippet alone.

Frontier Model Releases Evaluation and Benchmarking claude.ai Claude Opus 4.6 Databricks +16 more

5Don'T Worry About The Vase·22d ago·source ↗

Claude Opus 4.8: The System Card — Commentary

Zvi Mowshowitz publishes commentary on Claude Opus 4.8, released approximately six weeks after Opus 4.7. The piece appears to analyze the model's system card, suggesting a rapid iteration cadence from Anthropic. As a tier-2 commentary source, this provides analytical perspective on the release rather than primary documentation.

Frontier Model Releases AI Safety Research Claude Opus 4.6 Zvi Mowshowitz Anthropic

7The Batch·19d ago·source ↗

Claude Opus 4.8 Launches with Improved Honesty; Anthropic Previews Mythos-Class Models and Dynamic Workflows

Anthropic released Claude Opus 4.8 with improvements in coding, reasoning, agentic tasks, and notably better uncertainty flagging—approximately four times less likely than Opus 4.7 to let code flaws pass uncommented. Alongside the model, Anthropic introduced dynamic workflows in Claude Code enabling tens to hundreds of parallel subagents for large-scale engineering tasks, an effort-control slider, and a 3x price cut on fast mode. Anthropic also previewed Mythos-class models, positioned above Opus in capability, currently available to a limited set of organizations for cybersecurity work pending broader safety clearance. The same digest covers MiniMax M3 (open-weights, ~60% SWE-Bench Pro), Nvidia's RTX Spark superchip, Cosmos 3 world model, and a GR00T/Unitree robotics partnership.

Frontier Model Releases Evaluation and Benchmarking Unitree Harvey Claude Mythos +16 more

9Anthropic News·19d ago·source ↗

Claude Opus 4.6 Released with 1M Token Context, Agentic Coding Advances, and State-of-the-Art Benchmarks

Anthropic has released Claude Opus 4.6, its most capable model to date, featuring a 1M token context window in beta, improved agentic coding and planning capabilities, and adaptive thinking with developer-controlled effort levels. The model claims top scores on Terminal-Bench 2.0, Humanity's Last Exam, GDPval-AA, and BrowseComp, outperforming OpenAI's GPT-5.2 by 144 Elo points on GDPval-AA. New product features include agent teams in Claude Code, context compaction for long-running tasks, and Claude in PowerPoint (research preview). Pricing remains unchanged at $5/$25 per million input/output tokens.

Long Context Evolution Frontier Model Releases GPT-5.2 Claude Opus 4.6 adaptive thinking +13 more

9Anthropic News·19d ago·source ↗

Anthropic Introduces Claude Opus 4 and Sonnet 4 with Leading Coding Benchmarks and Agent Capabilities

Anthropic has released Claude Opus 4 and Claude Sonnet 4, positioning Opus 4 as the world's best coding model with 72.5% on SWE-bench and 43.2% on Terminal-bench, and Sonnet 4 at 72.7% on SWE-bench. Both models are hybrid (near-instant + extended thinking), support extended thinking with tool use in beta, parallel tool execution, and improved memory via local file access. Alongside the models, Anthropic is launching Claude Code as generally available with GitHub Actions, VS Code, and JetBrains integrations, plus four new API capabilities: code execution tool, MCP connector, Files API, and one-hour prompt caching. Pricing is unchanged from prior Opus and Sonnet tiers ($15/$75 and $3/$15 per million tokens respectively), with availability on Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI.

Long Context Evolution Frontier Model Releases Claude Sonnet 4 Amazon Bedrock Claude Opus 4.6 +21 more

Claude Opus 4.6: Anthropic's Long-Context Agentic Frontier Model

Part of these paths

Key takeaways

What it is

Benchmark position at launch

Architecture and capabilities

Real-world security capability

Ecosystem integration

Safety posture

Successor trajectory

Claude Opus 4.6: capability lineage and deployment footprint

Opus 4.6 in the Claude Opus lineage and against key rivals

Timeline

Related topics

FAQ

Stay current

Versions

Related guides (4)

Claude Opus 4.6: Anthropic's Milestone Model for Long-Context and Agentic Work

Claude: Anthropic's AI Assistant Built for Safety and Scale

Claude Code: Anthropic's Autonomous Coding Agent

GPT-5.5: OpenAI's Benchmark-Leading Agentic Model with a Hallucination Problem

More on Claude Opus 4.6 (6)

Anthropic Releases Claude Opus 4.7 with Enhanced Coding, Vision, and Cyber Safeguards

Claude Opus 4.8 Released by Anthropic

Claude Opus 4.8: The System Card — Commentary

Claude Opus 4.8 Launches with Improved Honesty; Anthropic Previews Mythos-Class Models and Dynamic Workflows

Claude Opus 4.6 Released with 1M Token Context, Agentic Coding Advances, and State-of-the-Art Benchmarks

Anthropic Introduces Claude Opus 4 and Sonnet 4 with Leading Coding Benchmarks and Agent Capabilities