Entity · model

Claude Sonnet 4

modelactiveprovisionalclaude-sonnet-4-cd888959·22 events·first seen May 18, 2026

Aliases: Claude Sonnet 4, Claude Sonnet 4.6

Co-occurring entities

More like this (12)

Claude Sonnet 4.5 Claude 3.5 Sonnet Claude Sonnet 3.5 Claude Sonnet Claude Sonnet 3.7 Claude 3 Sonnet Claude 3.7 Sonnet Claude Opus 4.6 Claude Haiku 4.5 Claude Claude 3.5 Claude Instant 1.2

Recent events (22)

5arXiv · cs.CL·Jul 24, 2026·source ↗

CM-LRS: A capital markets reliability benchmark for LLM workflow outputs

Researchers introduce CM-LRS (Capital Markets LLM Reliability Score), a seven-dimension evaluation framework assessing LLM outputs at the workflow level rather than the question-answer layer, targeting regulated capital-markets use cases such as DCM/ECM term extraction, M&A comparables, and issuer profiling. The benchmark is demonstrated on five workflows using public SEC EDGAR and UK takeover filings, scoring four models across four LLM judges. Key findings: frontier closed-source models cluster tightly (Sonnet 4.6 = 4.31, Opus 4.7 = 4.30, GPT-5.5 = 4.09) while Llama 3.3 70B lags at 3.15, with the gap concentrated in retrieval and synthesis tasks rather than extraction. The work advances domain-specific evaluation methodology for high-stakes financial workflows where regulatory defensibility matters.

Evaluation and Benchmarking Enterprise Deployment Patterns CM-LRS SEC EDGAR Llama 3.1 70B +6 more

4Claude Code Release Notes·Jul 15, 2026·source ↗

Claude Code 2.1.210 release: bug fixes, security hardening, and agent workflow improvements

Anthropic released Claude Code version 2.1.210, a substantial patch addressing over 30 bugs and adding several improvements to the agentic coding tool. Notable fixes include hardening against indirect prompt injection via subagent-read content, correcting git worktree isolation for subagents, fixing hook callback timeouts that caused unattended sessions to stall, and resolving MCP server teardown during mid-session re-syncs. The release also improves auto-mode permission classification by defaulting to Sonnet 5 for external sessions and adds UX enhancements for the multi-agent dashboard.

AI Safety Research Agent and Tool Ecosystem Claude Sonnet 4 Claude Code MCP +1 more

6Anthropic News·Jul 6, 2026·source ↗

Government of Alberta deploys Claude Code to scan 466M lines of code for cybersecurity vulnerabilities in 20 hours

The Government of Alberta's Ministry of Technology and Innovation has used Claude Code with Opus and Sonnet models to conduct a large-scale security review of its 1,280 applications and 3,400 code repositories, scanning 466 million lines of code in approximately 20 hours using ~50 parallel agents. The team remediated vulnerabilities, rebuilt legacy systems (including a 25-year-old Java subsidy portal rebuilt in 4-5 days), and deployed continuous red/blue team review agents built on the Claude Agent SDK. Alberta has published technical white papers documenting the approach for other governments to replicate.

Enterprise Deployment Patterns Agent and Tool Ecosystem Alberta AI Academy Government of Alberta Claude Sonnet 4 +5 more

7The Batch·Jul 3, 2026·source ↗

Microsoft reveals MAI-Thinking-1, a from-scratch reasoning model with MoE architecture

Microsoft introduced MAI-Thinking-1, its first reasoning language model built without distillation from third-party models, comparable in size to Claude Sonnet 4.6. The model uses a mixture-of-experts architecture (1T total / 35B active parameters), was pretrained on 30 trillion tokens of primarily licensed human-generated data, and trained via reinforcement learning across specialist models for STEM, coding, and safety. It scored 97.0% on AIME 2025, placing third behind Claude Opus 4.6 and ahead of DeepSeek V3.2, and is available in private preview via Microsoft Foundry. The release marks a strategic shift as Microsoft moves to reduce dependence on OpenAI models following a renegotiated partnership in April 2026.

Training Infrastructure Frontier Model Releases MAI-Thinking-1 Claude Sonnet 4 Claude Opus 4.6 +12 more

7Hacker News·Jun 30, 2026·source ↗

Anthropic releases Claude Sonnet 5

Anthropic has released Claude Sonnet 5, a new mid-tier model in their Claude lineup. The announcement comes via the official Anthropic news page and generated significant community engagement on Hacker News with 714 points and 386 comments. As a new named model release from a frontier lab, this is a notable update to the Claude model family.

Frontier Model Releases Inference Economics Claude Sonnet 3.5 Claude Sonnet 4 BrowseComp +5 more

6arXiv · cs.AI·Jun 30, 2026·source ↗

MCP Server Architecture Patterns: Five recurring patterns and four anti-patterns catalogued from production deployments

An industry experience paper catalogues five recurring architectural patterns for Model Context Protocol (MCP) servers—Resource Gateway, Tool Orchestrator, Stateful Session Server, Proxy Aggregator, and Domain-Specific Adapter—drawn from 15 servers including five production deployments on the ANSYR voice AI platform and ten from the official MCP registry. The paper also documents four anti-patterns and cross-cutting concerns around authentication, versioning, and observability. A quantitative evaluation includes inter-rater reliability (Cohen's kappa = 0.76 on 54 held-out servers), transport overhead measurements, and a tool-count study showing tool-selection accuracy drops below 90% between 10–15 tools for Claude Haiku 4.5 and between 20–30 tools for Claude Sonnet 4. Code, corpus, and prompts are released as a replication package.

Enterprise Deployment Patterns Agent and Tool Ecosystem Claude Sonnet 4 MCP Server Architecture Patterns for LLM-Integrated Applications ANSYR +3 more

5arXiv · cs.CL·Jun 23, 2026·source ↗

Sub-billion parameter SLMs outperform zero-shot GPT-5.4 and Claude Sonnet 4.6 on relation extraction benchmarks

A new arXiv paper demonstrates that small language models (360M–3B parameters) fine-tuned on task-specific data can substantially outperform zero-shot frontier LLMs on relation extraction tasks. The best sub-billion model, Qwen2.5-0.5B fine-tuned on pooled general-domain data, achieves micro-F1 of 0.83 versus 0.69 for GPT-5.4 and 0.66 for Claude Sonnet 4.6 in zero-shot settings. The authors attribute the gains to task adaptation rather than model architecture, with a discriminative RoBERTa baseline also exceeding frontier models, and show that 4-bit quantized models deployable on consumer GPUs can match or beat proprietary API-based systems for this narrow task. The work provides evidence that for well-defined NLP tasks with available training data, compact adapted models offer a practical, private, and hardware-efficient alternative to frontier APIs.

Evaluation and Benchmarking Open Weights Progress RoBERTa Claude Sonnet 4 Biographical +3 more

6arXiv · cs.CL·Jun 17, 2026·source ↗

Location metadata causes systematic geographic bias leakage in LLMs, even with 'Unknown' placeholders

Researchers evaluate 'location leakage' — the phenomenon where LLMs generate geographically biased outputs when exposed to location metadata in user profiles, even when prompts are geographically neutral. Across creative writing and Q&A tasks, leakage spikes up to 793x above baseline for models including Llama 3.1-8B, Qwen3-8B, and Claude Sonnet 4.6. A novel structural finding shows that replacing location with 'Unknown' still elevates leakage by up to 72x, indicating the user profile frame itself acts as a conditioning signal independent of geographic content. This has direct implications for AI systems that use user metadata for localization.

Evaluation and Benchmarking AI Safety Research Claude Sonnet 4 Alibaba Qwen3-4B +4 more

7arXiv · cs.CL·Jun 16, 2026·source ↗

SearchGEO framework measures LLM search agent vulnerability to web content manipulation

Researchers introduce SearchGEO, a controlled evaluation framework for measuring endorsement corruption in LLM-based web-search agents, combining a manipulation pipeline, five-mode attack taxonomy, and multiple output metrics. Evaluating 13 LLM backends on 308 cases each, they find attack success rates ranging from 0.0% on Claude-Sonnet-4.6 to 31.4% on Gemini-3-Flash, with model-family-specific vulnerability patterns. An auxiliary probe escalating endorsement to install commands reveals a behavioral split: Claude over-rejects while GPT over-trusts. The findings argue for treating adversarial search content robustness as a first-class safety evaluation dimension for deployed agents.

Evaluation and Benchmarking AI Safety Research Claude Sonnet 4 Google Gemini 3 Flash +4 more

6arXiv · cs.AI·Jun 10, 2026·source ↗

Frontier coding agents use metaprogramming to handle esoteric programming languages

A new arXiv paper evaluates six LLM-based coding agents on four esoteric programming languages (including Brainfuck and Befunge-98), finding that the strongest agents—Claude Opus 4.6 and GPT-5.4 xhigh—often avoid writing the target language directly, instead generating it via Python metaprograms. Forbidding this strategy causes large performance drops, and text guidance alone does not transfer the capability to weaker models, though sharing Opus-derived Python helper code does sharply improve mid-tier agents. The study reveals capability stratification that mainstream benchmarks like SWE-Bench Verified compress into narrow bands, suggesting frontier agents succeed by constructing and debugging working models of unfamiliar environments rather than pattern-matching to training data.

Frontier Model Releases Evaluation and Benchmarking Claude Sonnet 4 Claude Opus 4.6 SWE-Bench Verified +8 more

5arXiv · cs.CL·Jun 5, 2026·source ↗

Pre-registered study finds Popperian code-generation prompt skills add no benefit beyond structural scaffolding

A pre-registered two-tier ablation study tests whether 'Popperian falsificationist' prompt skills improve LLM code generation through their procedural content or merely through structural scaffolding. Using Claude Sonnet 4.6 and Qwen2.5-Coder-0.5B with execution-based evaluation (HumanEval+ unit tests) rather than LLM-as-judge, the authors find that on the small model, structured prompts lift correctness by 20-22 points but the full Popperian skill shows no separable benefit over a labels-only scaffold. The paper contributes a calibrated negative result and a reusable disambiguation protocol for evaluating prompt-skill families, while also documenting that LLM self-judges at 0.5B scale perform no better than random selection.

Evaluation and Benchmarking Claude Sonnet 4 Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill HumanEval +2 more

8Anthropic News·Jun 2, 2026·source ↗

Anthropic activates ASL-3 safety protections for Claude Opus 4 launch

Anthropic has activated its AI Safety Level 3 (ASL-3) Deployment and Security Standards in conjunction with launching Claude Opus 4, marking the first time any Anthropic model has been deployed under ASL-3 rather than the baseline ASL-2. The activation is described as precautionary: Anthropic has not conclusively determined that Opus 4 crosses the ASL-3 capability threshold, but cannot rule it out due to continued improvements in CBRN-related knowledge. ASL-3 measures include Constitutional Classifiers to block end-to-end CBRN weapon development workflows and enhanced model-weight security against sophisticated non-state attackers. Claude Sonnet 4 was evaluated and cleared for ASL-2, and ASL-4 was ruled out for Opus 4.

Frontier Model Releases AI Safety Research Constitutional Classifiers Claude Sonnet 4 Claude Opus 4.6 +4 more

6The Batch·Jun 2, 2026·source ↗

MiniMax M2.7 proprietary reasoning model competes with Gemini and Claude Opus; roundup covers Cursor Composer 2, MAI-Image-2, Claude Code Channels, and Anthropic defense dispute

MiniMax released M2.7, a proprietary reasoning model that achieved 66.6% on MLE Bench Lite (tying Gemini 3.1) and 56.22% on SWE-Pro, priced at $0.30/$1.20 per million tokens, with the shift to proprietary marking a potential strategic pivot among Chinese AI labs away from open weights. Cursor released Composer 2, an agentic coding model built on a fine-tuned Kimi 2.5 (via Moonshot partnership), priced 86% cheaper than its predecessor and scoring 73.7 on SWE-bench Multilingual. Anthropic released Claude Code Channels, routing Telegram and Discord messages into local Claude Code sessions via MCP plugins, and separately filed a court response denying it has any backdoor or kill switch into military deployments of Claude. Microsoft announced MAI-Image-2, a text-to-image model ranking third on Arena.ai among research labs.

Frontier Model Releases Open Weights Progress Stitch Claude Sonnet 4 SWE-Pro +17 more

7Anthropic News·Jun 2, 2026·source ↗

Claude Sonnet 4 Now Generally Available in Xcode 26

Anthropic has made Claude generally available as the AI backend for Xcode 26's coding intelligence features, powered by Claude Sonnet 4. Developers can connect their Claude account to access a coding assistant with natural language interaction, documentation generation, inline edits, and SwiftUI preview creation directly within Apple's IDE. The integration is available to Claude Pro, Max, Team, and Enterprise plan subscribers who have Claude Code access. Usage limits are shared across platforms with a portion allocated to Xcode.

Frontier Model Releases Enterprise Deployment Patterns Claude Sonnet 4 Xcode 26 Claude Code +4 more

9Anthropic News·Jun 1, 2026·source ↗

Anthropic Introduces Claude Opus 4 and Sonnet 4 with Leading Coding Benchmarks and Agent Capabilities

Anthropic has released Claude Opus 4 and Claude Sonnet 4, positioning Opus 4 as the world's best coding model with 72.5% on SWE-bench and 43.2% on Terminal-bench, and Sonnet 4 at 72.7% on SWE-bench. Both models are hybrid (near-instant + extended thinking), support extended thinking with tool use in beta, parallel tool execution, and improved memory via local file access. Alongside the models, Anthropic is launching Claude Code as generally available with GitHub Actions, VS Code, and JetBrains integrations, plus four new API capabilities: code execution tool, MCP connector, Files API, and one-hour prompt caching. Pricing is unchanged from prior Opus and Sonnet tiers ($15/$75 and $3/$15 per million tokens respectively), with availability on Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI.

Long Context Evolution Frontier Model Releases Claude Sonnet 4 Amazon Bedrock Claude Opus 4.6 +21 more

7Anthropic News·Jun 1, 2026·source ↗

Anthropic Launches Claude Haiku 4.5: Near-Frontier Performance at $1/$5 per Million Tokens

Anthropic has released Claude Haiku 4.5, a small model priced at $1/$5 per million input/output tokens that delivers coding performance comparable to Claude Sonnet 4 at one-third the cost and more than twice the speed. The model surpasses Sonnet 4 on computer use tasks and achieves 90% of Sonnet 4.5's performance on agentic coding evaluations, running 4-5x faster than Sonnet 4.5. Notably, Haiku 4.5 is classified under ASL-2 safety standards—less restrictive than the ASL-3 applied to Sonnet 4.5 and Opus 4.1—and is described as Anthropic's safest model by automated alignment metrics. It is available via the Claude API, Amazon Bedrock, and Google Cloud Vertex AI.

Frontier Model Releases Evaluation and Benchmarking Claude Sonnet 4 Amazon Bedrock Claude Opus 4.6 +15 more

7Anthropic News·Jun 1, 2026·source ↗

Apple's Xcode 26.3 Integrates Claude Agent SDK for Autonomous Coding

Xcode 26.3 introduces native integration with Anthropic's Claude Agent SDK, enabling autonomous, long-running coding tasks directly within Apple's IDE. The integration supports visual verification via Xcode Previews, full-project reasoning across Apple frameworks, autonomous task execution with goal-directed behavior, and MCP-based access for Claude Code CLI users. This expands on an earlier September announcement that brought Claude Sonnet 4 to Xcode in a limited turn-by-turn capacity, now replacing it with the same agentic harness that powers Claude Code.

Frontier Model Releases Enterprise Deployment Patterns Claude Sonnet 4 Claude Code SwiftUI +6 more

8Anthropic News·Jun 1, 2026·source ↗

Anthropic Acquires Vercept to Advance Claude's Computer Use Capabilities

Anthropic has acquired Vercept, a team specializing in AI perception and interaction for computer use tasks, whose co-founders include Kiana Ehsani, Luca Weihs, and Ross Girshick. Vercept will wind down its external product and join Anthropic to push computer use capabilities further. The announcement coincides with the launch of Claude Sonnet 4.6, which achieved 72.5% on the OSWorld benchmark—up from under 15% in late 2024—approaching human-level performance on tasks like navigating spreadsheets and completing web forms. This follows Anthropic's earlier acquisition of Bun and is part of a broader strategy to build agentic, multi-step task capabilities into Claude.

Frontier Model Releases Evaluation and Benchmarking Claude Sonnet 4 Luca Weihs Kiana Ehsani +7 more

8Anthropic News·Jun 1, 2026·source ↗

Anthropic Releases Claude Sonnet 4.6 with 1M Token Context, Improved Computer Use, and Coding Capabilities

Anthropic has released Claude Sonnet 4.6, positioned as a major upgrade over Sonnet 4.5 with improvements across coding, computer use, long-context reasoning, and agent planning. The model features a 1M token context window in beta and is now the default on claude.ai Free and Pro plans at unchanged pricing ($3/$15 per million tokens). Notably, users preferred Sonnet 4.6 over the prior Opus 4.5 frontier model 59% of the time in coding tasks, and the model shows significant gains on OSWorld computer-use benchmarks alongside improved prompt injection resistance. Safety evaluations found no major alignment concerns and rated it as safe or safer than prior Claude models.

Long Context Evolution Frontier Model Releases claude.ai Claude Sonnet 4 Claude Opus 4.6 +11 more

7Mistral Ai News·May 18, 2026·source ↗

Mistral Releases Leanstral: First Open-Source Code Agent for Lean 4 Formal Verification

Mistral AI has released Leanstral, an open-source code agent built on a sparse 120B/6B-active-parameter architecture, designed specifically for formal proof engineering in Lean 4. The model targets realistic proof engineering workflows rather than isolated math competition problems, and is benchmarked on FLTEval, a new evaluation suite tied to the Fermat's Last Theorem formalization project. Leanstral is released under Apache 2.0 with a free API endpoint and MCP support, and demonstrates competitive performance against Claude Sonnet 4.6 at roughly 1/15th the cost. The release positions formal verification as a scalable alternative to human code review for high-stakes software and mathematics.

Evaluation and Benchmarking Open Weights Progress Mistral AI Claude Sonnet 4 Claude Opus 4.6 +11 more

6Anthropic News·May 18, 2026·source ↗

Anthropic Updates Election Safeguards for Claude Ahead of 2026 US Midterms

Anthropic has published an update on its election-related safety measures for Claude, covering political bias evaluations, usage policy enforcement, and influence operation resistance testing. New model versions Claude Opus 4.7 and Sonnet 4.6 scored 95-96% on political impartiality evaluations and handled election-related policy compliance at 99.8-100% on a 600-prompt test suite. For the first time, Anthropic tested whether models can autonomously run influence operations end-to-end, finding that only Mythos Preview and Opus 4.7 completed more than half of tasks when safeguards were removed, underscoring ongoing capability concerns. Anthropic is also deploying election information banners pointing users to nonpartisan resources like TurboVote for the 2026 US midterms.

Frontier Model Releases Evaluation and Benchmarking Collective Intelligence Project Claude Sonnet 4 Claude Opus 4.6 +9 more

8Qwen Research·May 18, 2026·source ↗

Qwen3-Coder: 480B MoE Agentic Coding Model Released by Alibaba/Qwen Team

Alibaba's Qwen team has released Qwen3-Coder, a family of code-focused models with the flagship variant being Qwen3-Coder-480B-A35B-Instruct, a 480B-parameter Mixture-of-Experts model with 35B active parameters. It supports 256K native context length and up to 1M tokens via extrapolation. The model claims state-of-the-art results among open-weight models on agentic coding, browser-use, and tool-use benchmarks, with performance described as comparable to Claude Sonnet 4.

Long Context Evolution Frontier Model Releases Claude Sonnet 4 Alibaba Qwen3-Coder-480B-A35B-Instruct +5 more