6Anthropic News·16d ago

Anthropic expands model safety bug bounty to target universal jailbreaks in CBRN and cybersecurity domains

Anthropic is expanding its HackerOne-partnered bug bounty program to offer up to $15,000 for novel universal jailbreak attacks against a next-generation safety mitigation system not yet publicly deployed. The program specifically targets high-risk domains including CBRN (chemical, biological, radiological, nuclear) and cybersecurity, with participants given early access to test the new safeguards before release. The initiative begins as invite-only and aligns with Anthropic's commitments under the White House Voluntary AI Commitments and G7 Hiroshima Process Code of Conduct.

AI Safety Research Regulatory Developments HackerOne Hiroshima AI Process White House Voluntary AI Commitments Anthropic

Related guides (3)

Anthropic

Anthropic: The AI Safety Company at the Center of the Frontier

Read asBeginner In-depth

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Regulatory DevelopmentsTopic guide

AI Regulatory Developments: From Voluntary Frameworks to Government Enforcement

Read asBeginner In-depth

Related events (8)

6Anthropic News·18d ago·source ↗

Anthropic launches bug bounty program to stress-test ASL-3 Constitutional Classifiers

Anthropic launched an invite-only bug bounty program in partnership with HackerOne to find universal jailbreaks in its Constitutional Classifiers system before public deployment, offering up to $25,000 per verified vulnerability. The program targets CBRN-related safety bypasses on Claude 3.7 Sonnet and is part of Anthropic's work to meet its AI Safety Level-3 (ASL-3) Deployment Standard under its Responsible Scaling Policy. A follow-up update extended the program to test Constitutional Classifiers on the new Claude Opus 4 model and began accepting reports of universal jailbreaks found on public platforms. The initiative reflects Anthropic's structured approach to pre-deployment safety validation for increasingly capable models.

Frontier Model Releases AI Safety Research Constitutional Classifiers Claude Opus 4.6 HackerOne +3 more

5Openai Blog·1mo ago·source ↗

Introducing the OpenAI Safety Bug Bounty Program

OpenAI has launched a Safety Bug Bounty program targeting AI-specific abuse and safety risks. The program focuses on agentic vulnerabilities, prompt injection, and data exfiltration scenarios. This extends traditional security bug bounty models into AI safety territory, incentivizing external researchers to surface novel attack vectors.

AI Safety Research Enterprise Deployment Patterns prompt injection OpenAI Safety Bug Bounty agentic vulnerabilities +3 more

7Openai Blog·1mo ago·source ↗

GPT-5.5 Bio Bug Bounty

OpenAI has launched a red-teaming bug bounty program specifically targeting biosafety risks in GPT-5.5, offering rewards up to $25,000. The program focuses on finding universal jailbreaks that could bypass biological safety guardrails. This represents a structured external adversarial evaluation of a frontier model's safety properties in a high-stakes domain.

Frontier Model Releases Evaluation and Benchmarking GPT-5.5 Bio Bug Bounty OpenAI GPT-5.5 +1 more

3Openai Blog·1mo ago·source ↗

OpenAI Launches Bug Bounty Program

OpenAI announced a formal bug bounty program to crowdsource security vulnerability discovery across its products and services. The initiative is framed as part of OpenAI's broader commitment to building secure and trustworthy AI systems. Researchers who find and responsibly disclose vulnerabilities will be eligible for rewards.

AI Safety Research OpenAI

7Anthropic News·18d ago·source ↗

Anthropic Details Collaboration with US CAISI and UK AISI on Constitutional Classifier Red-Teaming

Anthropic has published an account of its ongoing voluntary partnership with the US Center for AI Standards and Innovation (CAISI) and UK AI Security Institute (AISI), in which government red-teamers were given deep access to pre-deployment versions of Constitutional Classifiers used on Claude Opus 4 and 4.1. The collaboration uncovered multiple vulnerability classes including prompt injection bypasses, cipher-based obfuscation attacks, universal jailbreaks via automated attack refinement, and input/output fragmentation exploits, each of which drove architectural improvements to Anthropic's safeguard systems. Key lessons shared include the value of providing unprotected model variants, real-time classifier score access, and detailed internal documentation to enable targeted red-teaming. The announcement frames government partnership as a core component of Anthropic's Safeguards approach rather than a one-off audit.

Frontier Model Releases Evaluation and Benchmarking Constitutional Classifiers prompt injection Claude Opus 4.6 +6 more

8Anthropic News·18d ago·source ↗

Anthropic expands Project Glasswing to 150 new organizations across critical infrastructure sectors

Anthropic is expanding Project Glasswing, its AI-assisted cybersecurity initiative, from ~50 initial partners to approximately 150 additional organizations spanning power, water, healthcare, communications, and hardware sectors across 15+ countries. Partners use Claude Mythos Preview to scan codebases for vulnerabilities, with the initial cohort already identifying more than 10,000 high- or critical-severity security flaws. Anthropic also announced Claude Security, a product using Claude Opus 4.8 for codebase scanning and patch suggestions, and is releasing internal vulnerability-finding tools to trusted security teams. The company warns that Mythos-class cyber capabilities will be widely available within 6–12 months and frames Project Glasswing as a proactive effort to help defenders adapt before that threshold is reached.

Frontier Model Releases AI Safety Research Claude Opus 4.6 Claude Mythos Preview Project Glasswing +3 more

7Anthropic News·16d ago·source ↗

Anthropic launches initiative to fund third-party AI safety evaluations

Anthropic announced a funded initiative to source third-party evaluations measuring advanced AI capabilities and safety risks, with priority areas including cybersecurity, CBRN threats, model autonomy, national security risks, social manipulation, and misalignment. The initiative is tied to Anthropic's Responsible Scaling Policy and AI Safety Level (ASL) framework, aiming to address a gap between demand and supply of high-quality safety-relevant evals. Proposals are solicited via an application form, with Anthropic framing the effort as benefiting the broader AI safety ecosystem rather than just internal use.

Evaluation and Benchmarking AI Safety Research METR Google-Proof Q&A Responsible Scaling Policy +1 more

9Anthropic News·7d ago·source ↗

US government orders Anthropic to suspend access to Fable 5 and Mythos 5 citing national security jailbreak concerns

The US government issued an export control directive requiring Anthropic to immediately disable Fable 5 and Mythos 5 for all foreign nationals, effectively forcing a full customer suspension to ensure compliance. The government cited awareness of a jailbreak method, but Anthropic disputes the severity, stating the demonstrated technique is a narrow, non-universal jailbreak that produces results already achievable by other publicly available models including GPT-5.5. Anthropic is complying with the directive while publicly disagreeing with the standard applied, arguing that requiring perfect jailbreak resistance would halt all frontier model deployments industry-wide. This is a significant regulatory and safety governance flashpoint involving government authority over commercial AI model access.

Frontier Model Releases AI Safety Research Fable 5 UK AI Security Institute Mythos +5 more