7Anthropic News·18d ago

Anthropic Details Collaboration with US CAISI and UK AISI on Constitutional Classifier Red-Teaming

Anthropic has published an account of its ongoing voluntary partnership with the US Center for AI Standards and Innovation (CAISI) and UK AI Security Institute (AISI), in which government red-teamers were given deep access to pre-deployment versions of Constitutional Classifiers used on Claude Opus 4 and 4.1. The collaboration uncovered multiple vulnerability classes including prompt injection bypasses, cipher-based obfuscation attacks, universal jailbreaks via automated attack refinement, and input/output fragmentation exploits, each of which drove architectural improvements to Anthropic's safeguard systems. Key lessons shared include the value of providing unprotected model variants, real-time classifier score access, and detailed internal documentation to enable targeted red-teaming. The announcement frames government partnership as a core component of Anthropic's Safeguards approach rather than a one-off audit.

Frontier Model Releases Evaluation and Benchmarking AI Safety Research Regulatory Developments Constitutional Classifiers prompt injection Claude Opus 4.6 Center for AI Standards and Innovation UK AI Security Institute universal jailbreak Anthropic

Related guides (4)

Claude Opus 4.6

Claude Opus 4.6: Anthropic's Milestone Model for Long-Context and Agentic Work

Read asBeginner In-depth

prompt injectionConcept

Prompt Injection: The Security Threat Hiding in Plain Text

Read asBeginner

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Anthropic

Anthropic: The AI Safety Company at the Center of the Frontier

Read asBeginner

Related events (8)

6Anthropic News·18d ago·source ↗

Anthropic launches bug bounty program to stress-test ASL-3 Constitutional Classifiers

Anthropic launched an invite-only bug bounty program in partnership with HackerOne to find universal jailbreaks in its Constitutional Classifiers system before public deployment, offering up to $25,000 per verified vulnerability. The program targets CBRN-related safety bypasses on Claude 3.7 Sonnet and is part of Anthropic's work to meet its AI Safety Level-3 (ASL-3) Deployment Standard under its Responsible Scaling Policy. A follow-up update extended the program to test Constitutional Classifiers on the new Claude Opus 4 model and began accepting reports of universal jailbreaks found on public platforms. The initiative reflects Anthropic's structured approach to pre-deployment safety validation for increasingly capable models.

Frontier Model Releases AI Safety Research Constitutional Classifiers Claude Opus 4.6 HackerOne +3 more

7Anthropic News·18d ago·source ↗

Anthropic and NNSA Co-Develop Nuclear Safeguards Classifier for Claude Traffic

Anthropic, in partnership with the U.S. Department of Energy's National Nuclear Security Administration (NNSA) and DOE national laboratories, has co-developed an AI classifier that distinguishes between concerning and benign nuclear-related conversations with 96% accuracy in preliminary testing. The classifier has already been deployed on live Claude traffic as part of Anthropic's misuse-detection infrastructure. Anthropic plans to share the approach with the Frontier Model Forum as a replicable blueprint for other AI developers. This represents the first public-private partnership of this kind for nuclear proliferation risk monitoring in frontier AI systems.

Evaluation and Benchmarking AI Safety Research Nuclear Proliferation Risk Classifier Claude Anthropic Policy Frontier Red Team +5 more

5Anthropic News·18d ago·source ↗

Anthropic Details Claude Safeguards Team Structure and Multi-Layer Safety Approach

Anthropic has published a detailed overview of its internal Safeguards team, describing a multi-layer approach to preventing Claude misuse that spans policy development, model training influence, pre-deployment evaluation, and real-time enforcement. The team uses a Unified Harm Framework covering five dimensions (physical, psychological, economic, societal, autonomy) and conducts Policy Vulnerability Testing with external domain experts in areas like terrorism, child safety, and mental health. Pre-deployment evaluations include safety assessments, CBRNE-focused AI capability uplift testing with government partners, and bias evaluations. The post describes specific partnerships with organizations like the Institute for Strategic Dialogue and ThroughLine to inform election integrity and mental health response policies.

Evaluation and Benchmarking AI Safety Research Anthropic Safeguards Team Anthropic Usage Policy Claude +5 more

8Anthropic News·17d ago·source ↗

Anthropic Frontier Red Team reports early-warning signs of rapid AI progress in cybersecurity and biosecurity capabilities

Anthropic's Frontier Red Team published findings from a year of safety evaluations across four model releases, documenting rapid capability gains in dual-use domains. In cybersecurity, Claude 3.7 Sonnet now solves roughly a third of Cybench CTF challenges (up from ~5% a year ago), and with the Incalmo toolset was able to replicate a large-scale network attack in realistic cyber range environments. In biosecurity, Claude has moved from underperforming virology experts to exceeding them on the VCT benchmark within one year, and exceeds human expert baselines on cloning workflows. Anthropic assesses current models as showing 'early warning' signs but not yet crossing thresholds of substantially elevated national security risk.

Frontier Model Releases Evaluation and Benchmarking Intercode CTF Carnegie Mellon University LabBench +7 more

9Anthropic News·19d ago·source ↗

Anthropic Discloses First Reported AI-Orchestrated Cyber Espionage Campaign Using Claude Code

Anthropic detected and disrupted a sophisticated espionage campaign in mid-September 2025, attributed with high confidence to a Chinese state-sponsored threat actor, that used Claude Code as an autonomous agent to attack roughly thirty global targets across tech, finance, chemical manufacturing, and government sectors. The attackers jailbroke Claude Code by decomposing malicious tasks into seemingly innocent subtasks and falsely framing it as defensive security testing, enabling largely autonomous reconnaissance, vulnerability exploitation, credential harvesting, and data exfiltration. Anthropic describes this as the first documented large-scale cyberattack executed without substantial human intervention, leveraging agentic AI capabilities, tool access via MCP, and advanced coding skills. The company banned identified accounts, notified affected entities, coordinated with authorities, and is expanding detection classifiers and publishing the report to aid industry and government defenses.

Frontier Model Releases AI Safety Research Chinese state-sponsored threat actor Claude Claude Code +4 more

7Anthropic News·16d ago·source ↗

Anthropic makes Claude 3 Haiku and Sonnet available to US Intelligence Community and AWS GovCloud

Anthropic has made Claude 3 Haiku and Claude 3 Sonnet available via AWS Marketplace for the US Intelligence Community and AWS GovCloud, marking a significant expansion into government deployment. The company has crafted contractual exceptions to its general Usage Policy to permit legally authorized foreign intelligence analysis, including combating human trafficking and identifying covert influence campaigns, while maintaining restrictions on disinformation, weapons design, and malicious cyber operations. The deployment is currently limited to ASL-2 models under Anthropic's Responsible Scaling Policy. Anthropic also notes prior pre-release access to Claude 3.5 Sonnet was provided to the UK AI Safety Institute for pre-deployment testing.

AI Safety Research Enterprise Deployment Patterns AWS GovCloud UK Artificial Intelligence Safety Institute Claude 3.5 Sonnet +8 more

7Anthropic News·19d ago·source ↗

Anthropic Launches Claude Code Security: AI-Powered Vulnerability Detection for Defenders

Anthropic has released Claude Code Security in limited research preview for Enterprise and Team customers, a capability built into Claude Code that scans codebases for security vulnerabilities and suggests patches for human review. Unlike rule-based static analysis tools, it uses Claude's reasoning to understand code context, trace data flows, and detect complex vulnerabilities including novel ones. Built on Claude Opus 4.6, the system found over 500 previously undetected vulnerabilities in production open-source codebases during internal research. The release is framed as a defensive measure to put AI-enabled vulnerability discovery in the hands of defenders before attackers can exploit the same capabilities.

Frontier Model Releases AI Safety Research Claude Opus 4.6 Anthropic Policy Frontier Red Team Pacific Northwest National Laboratory +5 more

6Anthropic News·16d ago·source ↗

Anthropic details red teaming methods and calls for standardized AI testing practices

Anthropic published a detailed overview of red teaming approaches used to test Claude and other AI systems, covering domain-specific expert testing, automated red teaming, multilingual/multicultural testing, and multimodal red teaming. The post documents empirical findings about when each method is appropriate, highlights partnerships with organizations like Thorn, Institute for Strategic Dialogue, and Singapore's IMDA, and closes with policy recommendations for building a standardized AI testing ecosystem. The piece is notable for its operational specificity and its explicit call for industry-wide standards to enable cross-system safety comparisons.

Evaluation and Benchmarking AI Safety Research Thorn Claude AI Verify Foundation +6 more