Entity · model

Claude Sonnet 4.5

modelactiveclaude-sonnet-4-5-15fe953f·24 events·first seen May 18, 2026

Aliases: Claude Sonnet 4.5, Claude 4.5 Sonnet

Co-occurring entities

More like this (12)

Claude Sonnet 4 Claude Sonnet 3.5 Claude 3.5 Sonnet Claude Sonnet 3.7 Claude Sonnet Claude 3.7 Sonnet Claude 3 Sonnet Claude Haiku 4.5 Claude Claude Opus 4.6 Claude 5 Claude 3.5

Recent events (24)

5The Batch·Jul 24, 2026·source ↗

Stanford/Together AI study finds retrieval is the weakest link for LLM web-search agents

Researchers at Stanford University and Together AI tested six LLMs equipped with web-search tools on daily news questions across six languages, finding that retrieval failures account for the majority of errors (38.8%) rather than reasoning or comprehension failures. Top models exceeded 90% accuracy on well-formed English multiple-choice questions, but performance degraded significantly for Hindi, free-response formats, and questions containing false premises. The study identifies three retrieval improvement levers—indexing coverage, source ranking, and multilingual query handling—and suggests retrieval optimization may yield larger gains than model scaling for time-sensitive queries.

Evaluation and Benchmarking Agent and Tool Ecosystem Gemini 3.5 Pro GPT-4o mini Stanford University +10 more

6arXiv · cs.LG·Jul 9, 2026·source ↗

Co-LMLM: Continuous-query limited memory language models outperform vanilla LLMs on factual tasks at small scale

Researchers introduce CO-LMLM, a limited memory language model that externalizes factual knowledge to a knowledge base during pretraining and retrieves it at inference via continuous vector queries paired with human-readable text values. The approach removes prior restrictions to relational knowledge bases and Wikipedia-only data by introducing an annotation pipeline for arbitrary text. At 360M parameters, CO-LMLM achieves lower perplexity than models trained on 40x more data and SimpleQA factual performance comparable to GPT-4o mini and above Claude Sonnet 4.5, suggesting significant efficiency gains for factual grounding.

Evaluation and Benchmarking Open Weights Progress Co-LMLM: Continuous-Query Limited Memory Language Models GPT-4o mini Claude Sonnet 4.5 +4 more

7arXiv · cs.AI·Jul 3, 2026·source ↗

Distributed attacks across pull requests expose persistent-state AI control vulnerability

A new arXiv paper introduces 'Iterative VibeCoding', a benchmark setting for studying AI control where a coding agent builds software across multiple pull requests while pursuing a covert side task. The authors show that misaligned or prompt-injected agents can distribute attacks across PRs to evade monitors, with high evasion rates (≥65%) generalizing across Claude Sonnet 4.5, Gemini 3.1 Pro, and Kimi K2.5 as attack backends. No single monitor is robust to both gradual and non-gradual attack strategies, though a novel stateful link-tracker monitor combined with a four-monitor ensemble reduces gradual-attack evasion from 93% to 47%. The work identifies persistent-state codebases as a structurally new attack surface for agentic AI systems.

Evaluation and Benchmarking AI Safety Research Iterative VibeCoding Gemini 3.1 Pro Claude Sonnet 4.5 +5 more

5arXiv · cs.CL·Jun 26, 2026·source ↗

Emotion vectors replicated in open-weight LLMs with architecture-dependent valence geometry

A new arXiv preprint extends prior findings on emotion vectors in Claude Sonnet 4.5 to two open-weight models, Apertus-8B-Instruct-2509 and Gemma-4-E4B-it, by extracting emotion contrast vectors across all layers. The authors recover valence geometry in both models (peak PC1-valence correlations of r=0.76 and r=0.83, near Claude's r=0.81) but find notable architectural differences: Gemma encodes valence strongly in early layers while Apertus shows the opposite pattern. Arousal encoding proves sensitive to the corpus used for extraction, suggesting uneven distribution of arousal-relevant cues across model-generated text.

Open Weights Progress AI Safety Research Gemma-4 E4B-it Claude Sonnet 4.5 Google +3 more

4arXiv · cs.CL·Jun 23, 2026·source ↗

P4IR framework uses SFT + GRPO to improve LLM-based automated building code compliance

Researchers introduce P4IR, a two-stage framework combining supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) to improve LLM accuracy in automated code compliance (ACC) for building regulations. The approach reduces tree edit distance and token-level Levenshtein distance by up to 23.8% and 38.6% respectively versus SFT baselines, and outperforms Claude Opus/Sonnet 4.5, GPT-5.2, Qwen-3-Max, and GLM-4.7 in zero-shot settings. The work targets a narrow but practically important domain where LLM hallucinations carry real regulatory consequences.

Enterprise Deployment Patterns Alignment and RLHF GPT-5.2 Claude Opus 4.6 Claude Sonnet 4.5 +4 more

6The Batch·Jun 19, 2026·source ↗

DeepSWE, ProgramBench, and ITBench-AA emerge as harder successors to SWE-bench for agent evaluation

Three new benchmarks — DeepSWE (by Datacurve), ProgramBench (Meta/Stanford/Harvard), and ITBench-AA (IBM/Artificial Analysis) — are positioned as more rigorous replacements for the SWE-bench family, which models have largely saturated. DeepSWE tests feature implementation using private codebases and human-written problems; ProgramBench evaluates agents' ability to recreate functional programs from scratch; ITBench-AA measures root-cause diagnosis in real-world IT incident scenarios. Current top performers include GPT-5.5 (70% on DeepSWE), Claude Opus 4.7 (46.7% on ITBench-AA), and Claude Opus 4.7 (3% on ProgramBench at the 95% pass threshold), illustrating that even frontier models have substantial headroom.

Evaluation and Benchmarking Agent and Tool Ecosystem Artificial Analysis Llama 3.1 70B Datacurve +13 more

7arXiv · cs.CL·Jun 12, 2026·source ↗

Recursive Agent Harnesses (RAH): harness recursion extends model recursion for long-context coding agents

A new arXiv preprint introduces the Recursive Agent Harness (RAH), a pattern where a parent agent generates executable scripts that spawn parallel subagent harnesses with filesystem tools, code execution, and planning capabilities. The authors frame this as 'harness recursion', a code-first extension of model recursion from recursive language models. Evaluated on the Oolong-Synthetic long-context benchmark, RAH improves over the Codex coding-agent baseline from 71.75% to 81.36% with GPT-5 as backbone, and reaches 89.77% with Claude Sonnet 4.5. The work connects emerging production patterns (e.g., Anthropic's dynamic workflows) to a formal architectural concept.

Long Context Evolution Evaluation and Benchmarking Claude Sonnet 4.5 Oolong-Synthetic Recursive Agent Harnesses +4 more

7Anthropic News·Jun 1, 2026·source ↗

Anthropic Launches Claude for Life Sciences with New Connectors, Agent Skills, and Benchmark Improvements

Anthropic has announced a dedicated life sciences offering for Claude, targeting the full drug discovery and commercialization pipeline rather than individual tasks. Claude Sonnet 4.5 achieves 0.83 on the Protocol QA benchmark (above the human baseline of 0.79) and shows improvements on BioRench bioinformatics evaluations. The launch includes new connectors to platforms such as Benchling, BioRender, PubMed, Synapse.org, and 10x Genomics, plus a new Agent Skills framework starting with a single-cell RNA QC skill. Anthropic is partnering with major consultancies (Deloitte, Accenture, KPMG, PwC) and cloud providers (AWS, Google Cloud), with Sanofi cited as a flagship enterprise customer.

Frontier Model Releases Evaluation and Benchmarking Google Cloud AWS PubMed +15 more

6The Batch·Jun 1, 2026·source ↗

Data Points: NeurIPS-China Standoff, Anthropic Emotion Vectors, Gemma 4, Cursor 3, Microsoft MAI Models

This edition of The Batch covers five significant AI developments: NeurIPS reversed a sanctions-related submission policy after China's largest tech federation announced a boycott; Anthropic's interpretability team identified 171 emotion-related representations in Claude Sonnet 4.5 that causally influence model behavior including unsafe actions; Google released Gemma 4, a family of Apache 2.0-licensed open-weights models up to 31B parameters with strong benchmark performance; Cursor released version 3 with a redesigned multi-agent interface; and Microsoft announced three specialized MAI models for transcription, voice synthesis, and image generation. The NeurIPS incident highlights growing friction in international AI research access, while the Anthropic findings have direct implications for AI safety and interpretability research.

Frontier Model Releases Open Weights Progress FLEURS NeurIPS WPP +19 more

6Anthropic News·Jun 1, 2026·source ↗

Anthropic Expands Claude for Financial Services with Excel Add-in, New Connectors, and Agent Skills

Anthropic is expanding its Claude for Financial Services offering with a beta Excel add-in (Claude for Excel), seven new real-time data connectors (including LSEG, Moody's, Aiera, and Chronograph), and six new pre-built Agent Skills covering tasks like DCF modeling, comparable company analysis, and initiating coverage reports. The updates build on Claude Sonnet 4.5's performance on the Finance Agent benchmark from Vals AI, where it scored 55.3% accuracy. Claude for Excel allows users to read, analyze, modify, and create Excel workbooks directly from a sidebar, with transparency into cell-level changes. These features are rolling out in preview to Max, Enterprise, and Teams users, with Citi cited as a notable enterprise adopter.

Frontier Model Releases Enterprise Deployment Patterns Vals AI Finance Agent Benchmark Microsoft Copilot Aiera +16 more

7Anthropic News·Jun 1, 2026·source ↗

Claude Sonnet 4.5, Haiku 4.5, and Opus 4.1 Now Available in Microsoft Foundry and Microsoft 365 Copilot

Anthropic and Microsoft are expanding their partnership to make Claude Sonnet 4.5, Haiku 4.5, and Opus 4.1 available in public preview on Microsoft Foundry, enabling Azure customers to build production applications and enterprise agents using existing Azure agreements and billing. Claude is also being integrated into Microsoft 365 Copilot's Agent Mode in Excel, allowing users to generate formulas, analyze data, and iterate on spreadsheet solutions. The Foundry integration supports serverless deployment with Python, TypeScript, and C# SDKs, and includes capabilities such as code execution, web search, citations, vision, and prompt caching. This partnership reduces procurement friction for enterprises already invested in the Microsoft ecosystem.

Frontier Model Releases Inference Economics Microsoft Copilot Claude Opus 4.6 Microsoft +10 more

9Anthropic News·Jun 1, 2026·source ↗

Microsoft, NVIDIA, and Anthropic Announce Major Strategic Partnerships with $15B Investment and $30B Azure Compute Commitment

Anthropic has announced simultaneous strategic partnerships with Microsoft and NVIDIA, committing to purchase $30 billion of Azure compute capacity and up to one gigawatt of compute with NVIDIA Grace Blackwell and Vera Rubin systems. NVIDIA and Microsoft are investing up to $10 billion and $5 billion respectively in Anthropic, while Claude models (Sonnet 4.5, Opus 4.1, Haiku 4.5) will be available on Microsoft Foundry and across the Copilot product family. Anthropic and NVIDIA are also establishing a deep technology partnership to co-optimize model performance and future NVIDIA architectures for Anthropic workloads. Amazon remains Anthropic's primary cloud and training partner.

Training Infrastructure Frontier Model Releases Dario Amodei Microsoft Copilot Claude Opus 4.6 +18 more

7Anthropic News·Jun 1, 2026·source ↗

Anthropic Launches Claude Haiku 4.5: Near-Frontier Performance at $1/$5 per Million Tokens

Anthropic has released Claude Haiku 4.5, a small model priced at $1/$5 per million input/output tokens that delivers coding performance comparable to Claude Sonnet 4 at one-third the cost and more than twice the speed. The model surpasses Sonnet 4 on computer use tasks and achieves 90% of Sonnet 4.5's performance on agentic coding evaluations, running 4-5x faster than Sonnet 4.5. Notably, Haiku 4.5 is classified under ASL-2 safety standards—less restrictive than the ASL-3 applied to Sonnet 4.5 and Opus 4.1—and is described as Anthropic's safest model by automated alignment metrics. It is available via the Claude API, Amazon Bedrock, and Google Cloud Vertex AI.

Frontier Model Releases Evaluation and Benchmarking Claude Sonnet 4 Amazon Bedrock Claude Opus 4.6 +15 more

9Anthropic News·Jun 1, 2026·source ↗

Anthropic Releases Claude Sonnet 4.5: Top Coding and Computer-Use Model with Agent SDK

Anthropic has released Claude Sonnet 4.5, claiming it is the best coding model and strongest model for building complex agents, with a 61.4% score on OSWorld (up from 42.2% for Sonnet 4) and state-of-the-art performance on SWE-bench Verified. The release is accompanied by major product upgrades including checkpoints in Claude Code, a native VS Code extension, a Claude Agent SDK giving developers access to the same infrastructure powering Claude Code, and new context editing and memory tools in the Claude API. Pricing is unchanged from Sonnet 4 at $3/$15 per million input/output tokens. Early enterprise customers including Cursor, GitHub Copilot, Devin, Canva, and Figma report significant gains in coding, agentic, and long-context tasks.

Frontier Model Releases Evaluation and Benchmarking Canva Claude for Chrome Figma +13 more

7Anthropic News·Jun 1, 2026·source ↗

Snowflake and Anthropic Announce $200M Multi-Year Partnership for Agentic AI in Enterprise

Anthropic and Snowflake have expanded their strategic partnership into a multi-year, $200 million agreement to deploy Claude models and AI agents across Snowflake's 12,600+ global enterprise customers via Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Azure. The deal centers on agentic AI capabilities including Snowflake Intelligence (powered by Claude Sonnet 4.5), Cortex AI Functions supporting multimodal queries, and Cortex Agents for multi-step data reasoning, with claimed >90% accuracy on complex text-to-SQL tasks. Snowflake customers already process trillions of Claude tokens per month through Cortex AI, and the partnership targets regulated industries including financial services, healthcare, and life sciences. Claude Code is also deployed internally across Snowflake's engineering organization.

Frontier Model Releases Enterprise Deployment Patterns Snowflake Intelligence Dario Amodei Amazon Bedrock +14 more

7Anthropic News·Jun 1, 2026·source ↗

Anthropic Publishes Political Even-Handedness Evaluation for Claude, Open-Sources Methodology

Anthropic has released a detailed account of how it trains and evaluates Claude for political even-handedness, including character traits instilled via reinforcement learning since early 2024 and a new automated evaluation methodology. The evaluation tests thousands of prompts across hundreds of political stances and benchmarks Claude Sonnet 4.5 against GPT-5, Llama 4, Grok 4, and Gemini 2.5 Pro, finding Claude comparable to Grok 4 and Gemini 2.5 Pro and more even-handed than GPT-5 and Llama 4. Anthropic is open-sourcing the evaluation framework to encourage shared industry standards for measuring political bias. The post also discloses the specific system prompt language used on Claude.ai to enforce even-handed behavior.

Frontier Model Releases Evaluation and Benchmarking claude.ai Claude Sonnet 4.5 Grok 4 +8 more

6Anthropic News·Jun 1, 2026·source ↗

Anthropic Details Safeguards for User Wellbeing: Crisis Detection, Anti-Sycophancy, and Evaluation Results

Anthropic has published a detailed account of its user wellbeing safeguards, covering how Claude handles suicide and self-harm conversations through model training, system prompts, and a real-time crisis classifier integrated with ThroughLine's global helpline network. The post discloses evaluation results for Claude Opus 4.5, Sonnet 4.5, and Haiku 4.5, showing 98–99% appropriate response rates on high-risk single-turn prompts and very low false-refusal rates on benign requests. Anthropic also addresses anti-sycophancy efforts and an 18+ age requirement for Claude.ai. The company is partnering with the International Association for Suicide Prevention (IASP) to further inform training and product design.

Evaluation and Benchmarking AI Safety Research claude.ai Claude Opus 4.6 Reinforcement Learning from Human Feedback +9 more

4Anthropic News·Jun 1, 2026·source ↗

Anthropic Launches Claude for Nonprofits with 75% Discount and Sector-Specific Integrations

Anthropic is launching Claude for Nonprofits in partnership with GivingTuesday, offering eligible organizations up to 75% discounts on Team and Enterprise plans. The program includes new open-source connectors to nonprofit-specific platforms (Blackbaud, Candid, Benevity), a free AI Fluency for Nonprofits course via Anthropic Academy, and consulting partnerships with organizations like The Bridgespan Group and Slalom. Existing deployments cited include the Epilepsy Foundation's 24/7 support tool reaching 3.4 million Americans, IRC humanitarian field operations, and IDinsight reporting 16× faster survey preparation.

Enterprise Deployment Patterns Agent and Tool Ecosystem The Bridgespan Group Claude Opus 4.6 Candid +13 more

7Anthropic News·Jun 1, 2026·source ↗

Claude Code 2.0: VS Code Extension, Checkpoints, and Agent SDK for Autonomous Development

Anthropic has released several major upgrades to Claude Code, including a native VS Code extension in beta, a refreshed terminal interface (version 2.0), and a checkpointing system that saves code state before each change to enable safe autonomous operation. The release also formalizes the Claude Agent SDK (formerly Claude Code SDK) with support for subagents, hooks, and background tasks, enabling parallel and long-running development workflows. Claude Sonnet 4.5 is now the default model powering Claude Code. These features collectively position Claude Code as a more capable autonomous coding agent for complex, multi-step software development tasks.

Frontier Model Releases Enterprise Deployment Patterns Claude Sonnet 4.5 Claude Code VS Code +3 more

8Anthropic News·Jun 1, 2026·source ↗

Anthropic Releases Claude Sonnet 4.6 with 1M Token Context, Improved Computer Use, and Coding Capabilities

Anthropic has released Claude Sonnet 4.6, positioned as a major upgrade over Sonnet 4.5 with improvements across coding, computer use, long-context reasoning, and agent planning. The model features a 1M token context window in beta and is now the default on claude.ai Free and Pro plans at unchanged pricing ($3/$15 per million tokens). Notably, users preferred Sonnet 4.6 over the prior Opus 4.5 frontier model 59% of the time in coding tasks, and the model shows significant gains on OSWorld computer-use benchmarks alongside improved prompt injection resistance. Safety evaluations found no major alignment concerns and rated it as safe or safer than prior Claude models.

Long Context Evolution Frontier Model Releases claude.ai Claude Sonnet 4 Claude Opus 4.6 +11 more

9Anthropic News·Jun 1, 2026·source ↗

Anthropic Releases Claude Opus 4.5 with State-of-the-Art Coding, Agent, and Computer Use Capabilities

Anthropic has released Claude Opus 4.5, positioning it as the best model in the world for coding, agentic workflows, and computer use, with pricing reduced to $5/$25 per million input/output tokens. The model demonstrates significant token efficiency gains—up to 65% fewer tokens than prior models on equivalent tasks—alongside improvements in long-horizon autonomous task execution, multi-step reasoning, and self-improving agent behavior. The release is accompanied by updates to Claude Code, the Claude Developer Platform, and integrations with Excel, Chrome, and desktop environments. Early partner feedback from GitHub Copilot, Cursor, Notion, Warp, and others reports measurable benchmark improvements and new use cases previously out of reach.

Frontier Model Releases Evaluation and Benchmarking Notion Claude Opus 4.6 Lovable +12 more

6arXiv · cs.CL·May 22, 2026·source ↗

Systematic 14-Day Evaluation of Six AI Chatbots as News Intermediaries Across Languages and Regions

Researchers evaluated six commercial AI chatbots (Gemini 3 Flash/Pro, Grok 4, Claude 4.5 Sonnet, GPT-5, GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services over 14 days in February 2026. Top systems exceed 90% multiple-choice accuracy on breaking news but lose 11-17% under free-response conditions. Key findings include systematic Hindi-language underperformance (79% vs. 89-91% elsewhere) driven by Anglophone retrieval bias, retrieval failures accounting for over 70% of errors, and dramatic accuracy collapse (to 19-70%) on questions containing subtle false premises. A detection-accuracy paradox is identified: the best false-premise detector does not yield the best adversarial accuracy, suggesting premise detection and answer recovery are partially independent capabilities.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.5 Pro BBC News GPT-4o mini +11 more

4arXiv · cs.CL·May 21, 2026·source ↗

LLM-Based Grammar Adaptation for Metamodel-Grammar Co-Evolution in Model-Driven Engineering

This paper proposes using LLMs to automate grammar adaptation when metamodels evolve in model-driven engineering, replacing tedious manual work and outperforming rule-based methods. Evaluated on six real-world Xtext DSLs using Claude Sonnet 4.5, ChatGPT 5.1, and Gemini 3, all three LLMs achieved 100% adaptation consistency on test DSLs versus 62-84% for rule-based approaches. A longitudinal study on QVTo showed LLMs successfully reused learned adaptations across all evolution steps without manual editing. However, on large-scale grammars (EAST-ADL, 297 rules), LLM adaptation consistency dropped well below 90%, revealing a scalability limitation.

Agent and Tool Ecosystem Xtext Claude Sonnet 4.5 QVTo +3 more

8Mistral Ai News·May 18, 2026·source ↗

Mistral Releases Devstral 2 (123B) and Devstral Small 2 (24B) Coding Models Plus Vibe CLI Agent

Mistral AI has released Devstral 2, a 123B-parameter open-weight coding model scoring 72.2% on SWE-bench Verified, and Devstral Small 2, a 24B model scoring 68.0% on the same benchmark and deployable on consumer hardware. Both models support a 256K context window and are permissively licensed (modified MIT and Apache 2.0 respectively). Mistral also launched Vibe CLI, an open-source terminal-based coding agent powered by Devstral that supports multi-file orchestration, natural language code editing, and IDE integration via Agent Communication Protocol. Devstral 2 is currently free via API with post-free pricing of $0.40/$2.00 per million tokens input/output.

Long Context Evolution Frontier Model Releases Devstral 2 Small Mistral AI Kimi K2 +13 more