6Anthropic News·18d ago

Anthropic launches bug bounty program to stress-test ASL-3 Constitutional Classifiers

Anthropic launched an invite-only bug bounty program in partnership with HackerOne to find universal jailbreaks in its Constitutional Classifiers system before public deployment, offering up to $25,000 per verified vulnerability. The program targets CBRN-related safety bypasses on Claude 3.7 Sonnet and is part of Anthropic's work to meet its AI Safety Level-3 (ASL-3) Deployment Standard under its Responsible Scaling Policy. A follow-up update extended the program to test Constitutional Classifiers on the new Claude Opus 4 model and began accepting reports of universal jailbreaks found on public platforms. The initiative reflects Anthropic's structured approach to pre-deployment safety validation for increasingly capable models.

Frontier Model Releases AI Safety Research Constitutional Classifiers Claude Opus 4.6 HackerOne Responsible Scaling Policy Claude 3.7 Sonnet Anthropic

Related guides (4)

Claude Opus 4.6

Claude Opus 4.6: Anthropic's Milestone Model for Long-Context and Agentic Work

Read asBeginner In-depth

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Anthropic

Anthropic: The AI Safety Company at the Center of the Frontier

Read asBeginner

AI Safety ResearchTopic guide

AI Safety Research: From Lab Evals to Geopolitical Flashpoint

Read asIn-depth

Related events (8)

6Anthropic News·16d ago·source ↗

Anthropic expands model safety bug bounty to target universal jailbreaks in CBRN and cybersecurity domains

Anthropic is expanding its HackerOne-partnered bug bounty program to offer up to $15,000 for novel universal jailbreak attacks against a next-generation safety mitigation system not yet publicly deployed. The program specifically targets high-risk domains including CBRN (chemical, biological, radiological, nuclear) and cybersecurity, with participants given early access to test the new safeguards before release. The initiative begins as invite-only and aligns with Anthropic's commitments under the White House Voluntary AI Commitments and G7 Hiroshima Process Code of Conduct.

AI Safety Research Regulatory Developments HackerOne Hiroshima AI Process White House Voluntary AI Commitments +1 more

7Anthropic News·18d ago·source ↗

Anthropic Details Collaboration with US CAISI and UK AISI on Constitutional Classifier Red-Teaming

Anthropic has published an account of its ongoing voluntary partnership with the US Center for AI Standards and Innovation (CAISI) and UK AI Security Institute (AISI), in which government red-teamers were given deep access to pre-deployment versions of Constitutional Classifiers used on Claude Opus 4 and 4.1. The collaboration uncovered multiple vulnerability classes including prompt injection bypasses, cipher-based obfuscation attacks, universal jailbreaks via automated attack refinement, and input/output fragmentation exploits, each of which drove architectural improvements to Anthropic's safeguard systems. Key lessons shared include the value of providing unprotected model variants, real-time classifier score access, and detailed internal documentation to enable targeted red-teaming. The announcement frames government partnership as a core component of Anthropic's Safeguards approach rather than a one-off audit.

Frontier Model Releases Evaluation and Benchmarking Constitutional Classifiers prompt injection Claude Opus 4.6 +6 more

8Anthropic News·18d ago·source ↗

Anthropic activates ASL-3 safety protections for Claude Opus 4 launch

Anthropic has activated its AI Safety Level 3 (ASL-3) Deployment and Security Standards in conjunction with launching Claude Opus 4, marking the first time any Anthropic model has been deployed under ASL-3 rather than the baseline ASL-2. The activation is described as precautionary: Anthropic has not conclusively determined that Opus 4 crosses the ASL-3 capability threshold, but cannot rule it out due to continued improvements in CBRN-related knowledge. ASL-3 measures include Constitutional Classifiers to block end-to-end CBRN weapon development workflows and enhanced model-weight security against sophisticated non-state attackers. Claude Sonnet 4 was evaluated and cleared for ASL-2, and ASL-4 was ruled out for Opus 4.

Frontier Model Releases AI Safety Research Constitutional Classifiers Claude Sonnet 4 Claude Opus 4.6 +4 more

7Anthropic News·19d ago·source ↗

Anthropic Launches Claude Code Security: AI-Powered Vulnerability Detection for Defenders

Anthropic has released Claude Code Security in limited research preview for Enterprise and Team customers, a capability built into Claude Code that scans codebases for security vulnerabilities and suggests patches for human review. Unlike rule-based static analysis tools, it uses Claude's reasoning to understand code context, trace data flows, and detect complex vulnerabilities including novel ones. Built on Claude Opus 4.6, the system found over 500 previously undetected vulnerabilities in production open-source codebases during internal research. The release is framed as a defensive measure to put AI-enabled vulnerability discovery in the hands of defenders before attackers can exploit the same capabilities.

Frontier Model Releases AI Safety Research Claude Opus 4.6 Anthropic Policy Frontier Red Team Pacific Northwest National Laboratory +5 more

9Anthropic News·17d ago·source ↗

Anthropic introduces computer use capability, upgraded Claude 3.5 Sonnet, and Claude 3.5 Haiku

Anthropic announced three major developments: an upgraded Claude 3.5 Sonnet with significant coding improvements (SWE-bench Verified rising from 33.4% to 49.0%, surpassing all publicly available models including reasoning models), a new Claude 3.5 Haiku that matches Claude 3 Opus performance at Haiku-tier speed, and a public beta of 'computer use' — a capability allowing Claude to control computers by viewing screens, moving cursors, clicking, and typing. Computer use is available via the Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI, with early adopters including Replit, The Browser Company, and Cognition. Both safety institutes (US AISI and UK AISI) conducted pre-deployment testing, and the model was assessed as remaining within ASL-2 under Anthropic's Responsible Scaling Policy.

Frontier Model Releases Evaluation and Benchmarking OpenAI o1-preview Amazon Bedrock Claude 3.5 Sonnet +15 more

7Anthropic News·16d ago·source ↗

Anthropic makes Claude 3 Haiku and Sonnet available to US Intelligence Community and AWS GovCloud

Anthropic has made Claude 3 Haiku and Claude 3 Sonnet available via AWS Marketplace for the US Intelligence Community and AWS GovCloud, marking a significant expansion into government deployment. The company has crafted contractual exceptions to its general Usage Policy to permit legally authorized foreign intelligence analysis, including combating human trafficking and identifying covert influence campaigns, while maintaining restrictions on disinformation, weapons design, and malicious cyber operations. The deployment is currently limited to ASL-2 models under Anthropic's Responsible Scaling Policy. Anthropic also notes prior pre-release access to Claude 3.5 Sonnet was provided to the UK AI Safety Institute for pre-deployment testing.

AI Safety Research Enterprise Deployment Patterns AWS GovCloud UK Artificial Intelligence Safety Institute Claude 3.5 Sonnet +8 more

9The Batch·8d ago·source ↗

Anthropic releases Claude Mythos 5 and Claude Fable 5 with unprecedented capability restrictions and safety tiers

Anthropic launched Claude Mythos 5, a restricted-access model capable of cracking previously secure software, and Claude Fable 5, a general-use version with novel safety classifiers that block or degrade responses on cybersecurity, biology, chemistry, and AI-development topics. Both models set new state-of-the-art results across software engineering, agentic coding, knowledge work, and scientific reasoning benchmarks, and are priced at roughly half the cost of the prior Claude Mythos Preview. Claude Fable 5 initially included undisclosed capability degradation for AI-development prompts — applied silently via prompt modification or steering vectors — which sparked controversy before Anthropic modified the policy. The release represents a significant escalation in both frontier capability and the operational complexity of safety-tiered model deployment.

Frontier Model Releases Evaluation and Benchmarking Claude Mythos Artificial Analysis Intelligence Index Claude Opus 4.6 +9 more

3Openai Blog·1mo ago·source ↗

OpenAI Launches Bug Bounty Program

OpenAI announced a formal bug bounty program to crowdsource security vulnerability discovery across its products and services. The initiative is framed as part of OpenAI's broader commitment to building secure and trustworthy AI systems. Researchers who find and responsibly disclose vulnerabilities will be eligible for rewards.

AI Safety Research OpenAI