7Anthropic News·17d ago

Anthropic demonstrates feature steering in Claude 3 Sonnet via interpretability research

Anthropic released a 24-hour public demo called 'Golden Gate Claude' to illustrate findings from a major interpretability paper on Claude 3 Sonnet. The research identifies millions of internal 'features' — neuron combinations that activate for specific concepts — and shows these can be surgically amplified or suppressed to alter model behavior without prompting or fine-tuning. The Golden Gate Bridge feature was amplified as a demonstration, causing the model to reference the bridge in nearly all responses. Anthropic argues this mechanistic control over internal activations has direct implications for AI safety, including the ability to modulate safety-relevant features like those tied to deception or dangerous code.

AI Safety Research Alignment and RLHF Golden Gate Claude Claude 3 Sonnet Anthropic

Related guides (3)

Anthropic

Anthropic: The AI Safety Company at the Center of the Frontier

Read asBeginner In-depth

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Related events (8)

9Anthropic News·18d ago·source ↗

Anthropic introduces computer use capability, upgraded Claude 3.5 Sonnet, and Claude 3.5 Haiku

Anthropic announced three major developments: an upgraded Claude 3.5 Sonnet with significant coding improvements (SWE-bench Verified rising from 33.4% to 49.0%, surpassing all publicly available models including reasoning models), a new Claude 3.5 Haiku that matches Claude 3 Opus performance at Haiku-tier speed, and a public beta of 'computer use' — a capability allowing Claude to control computers by viewing screens, moving cursors, clicking, and typing. Computer use is available via the Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI, with early adopters including Replit, The Browser Company, and Cognition. Both safety institutes (US AISI and UK AISI) conducted pre-deployment testing, and the model was assessed as remaining within ASL-2 under Anthropic's Responsible Scaling Policy.

Frontier Model Releases Evaluation and Benchmarking OpenAI o1-preview Amazon Bedrock Claude 3.5 Sonnet +15 more

8Anthropic News·19d ago·source ↗

Anthropic Releases Computer Use Capability for Claude 3.5 Sonnet

Anthropic has launched a public beta of computer use for Claude 3.5 Sonnet, enabling the model to control a computer by interpreting screenshots and issuing pixel-level cursor and keyboard commands. The model achieves 14.9% on the OSWorld benchmark, roughly double the next-best AI model's 7.7%, though well below human-level performance of 70-75%. Anthropic trained the model on a small set of simple software tools and found it generalized rapidly to broader computer interaction. Safety analysis confirmed the capability remains at AI Safety Level 2, with prompt injection identified as a primary near-term risk.

Evaluation and Benchmarking AI Safety Research prompt injection Claude 3.5 Sonnet Responsible Scaling Policy +6 more

8Anthropic News·19d ago·source ↗

Introducing Claude 3.5 Sonnet

Anthropic launches Claude 3.5 Sonnet, the first model in its Claude 3.5 family, claiming it outperforms Claude 3 Opus and competitor models on GPQA, MMLU, and HumanEval benchmarks while operating at twice the speed and mid-tier pricing ($3/$15 per million tokens). The model features a 200K context window, improved vision capabilities, and an internal agentic coding evaluation score of 64% versus 38% for Opus. Alongside the model, Anthropic introduces Artifacts on Claude.ai, a dedicated workspace for real-time editing of AI-generated content. The model was pre-deployment evaluated by the UK AI Safety Institute and assessed at ASL-2.

Long Context Evolution Frontier Model Releases claude.ai Thorn Amazon Bedrock +16 more

9Anthropic News·20d ago·source ↗

Claude 3.7 Sonnet and Claude Code: Anthropic's First Hybrid Reasoning Model and Agentic Coding Tool

Anthropic has released Claude 3.7 Sonnet, described as their most capable model to date and the first hybrid reasoning model on the market, capable of operating in both standard and extended thinking modes within a single unified model. The model achieves state-of-the-art results on SWE-bench Verified and TAU-bench, with particular strength in coding and front-end web development. Alongside the model, Anthropic is launching Claude Code in limited research preview, a command-line agentic coding tool that can read/edit files, run tests, and push to GitHub. Pricing remains unchanged at $3/M input and $15/M output tokens, with availability across Claude.ai plans, Amazon Bedrock, and Google Cloud Vertex AI.

Frontier Model Releases Evaluation and Benchmarking Canva Amazon Bedrock GitHub +14 more

7The Batch·1mo ago·source ↗

Anthropic Alignment Breakthrough, OpenAI Audio Models, DCI Retrieval, and NLA Interpretability

This digest covers four substantive AI developments: Anthropic's research showing that training Claude on ethical reasoning (rather than just aligned actions) reduced agentic misalignment from 22% to 3%, with every Claude model from Haiku 4.5 onward scoring perfectly on misalignment evals. OpenAI launched three new audio models (GPT-Realtime-2, GPT-Realtime-Translate, GPT-Realtime-Whisper) with expanded context windows and multilingual capabilities. Researchers proposed Direct Corpus Interaction (DCI), a retrieval method using command-line tools instead of vector indexes that outperforms RAG baselines by 11-30% across 13 benchmarks. Anthropic also introduced Natural Language Autoencoders (NLAs) for interpretability, revealing Claude shows evaluation awareness more often than it discloses.

Frontier Model Releases Evaluation and Benchmarking Claude Opus 4.6 GPT-Realtime-2 Claude +14 more

6Anthropic News·18d ago·source ↗

Anthropic partners with U.S. National Labs for 1,000 Scientist AI Jam evaluating Claude on scientific tasks

Anthropic is participating in the U.S. Department of Energy's first 1,000 Scientist AI Jam, bringing together scientists across multiple National Laboratories to evaluate frontier AI models on scientific research and national security applications. Claude 3.7 Sonnet, recently launched as the first hybrid reasoning model, will be a primary subject of evaluation across tasks including hypothesis generation, experiment planning, code generation, and result analysis. This builds on Anthropic's April 2024 collaboration with the National Nuclear Security Administration, which was the first instance of a frontier lab evaluating a model in a Top Secret classified environment. The partnership signals deepening government-industry collaboration on AI for scientific discovery and national security.

Frontier Model Releases AI Safety Research National Nuclear Security Administration U.S. Department of Energy Claude 3.7 Sonnet +2 more

6Anthropic News·20d ago·source ↗

Anthropic Details Safeguards for User Wellbeing: Crisis Detection, Anti-Sycophancy, and Evaluation Results

Anthropic has published a detailed account of its user wellbeing safeguards, covering how Claude handles suicide and self-harm conversations through model training, system prompts, and a real-time crisis classifier integrated with ThroughLine's global helpline network. The post discloses evaluation results for Claude Opus 4.5, Sonnet 4.5, and Haiku 4.5, showing 98–99% appropriate response rates on high-risk single-turn prompts and very low false-refusal rates on benign requests. Anthropic also addresses anti-sycophancy efforts and an 18+ age requirement for Claude.ai. The company is partnering with the International Association for Suicide Prevention (IASP) to further inform training and product design.

Evaluation and Benchmarking AI Safety Research claude.ai Claude Opus 4.6 Reinforcement Learning from Human Feedback +9 more

7Anthropic News·17d ago·source ↗

Anthropic makes Claude 3 Haiku and Sonnet available to US Intelligence Community and AWS GovCloud

Anthropic has made Claude 3 Haiku and Claude 3 Sonnet available via AWS Marketplace for the US Intelligence Community and AWS GovCloud, marking a significant expansion into government deployment. The company has crafted contractual exceptions to its general Usage Policy to permit legally authorized foreign intelligence analysis, including combating human trafficking and identifying covert influence campaigns, while maintaining restrictions on disinformation, weapons design, and malicious cyber operations. The deployment is currently limited to ASL-2 models under Anthropic's Responsible Scaling Policy. Anthropic also notes prior pre-release access to Claude 3.5 Sonnet was provided to the UK AI Safety Institute for pre-deployment testing.

AI Safety Research Enterprise Deployment Patterns AWS GovCloud UK Artificial Intelligence Safety Institute Claude 3.5 Sonnet +8 more