Anthropic Details Collaboration with US CAISI and UK AISI on Constitutional Classifier Red-Teaming
Anthropic has published an account of its ongoing voluntary partnership with the US Center for AI Standards and Innovation (CAISI) and UK AI Security Institute (AISI), in which government red-teamers were given deep access to pre-deployment versions of Constitutional Classifiers used on Claude Opus 4 and 4.1. The collaboration uncovered multiple vulnerability classes including prompt injection bypasses, cipher-based obfuscation attacks, universal jailbreaks via automated attack refinement, and input/output fragmentation exploits, each of which drove architectural improvements to Anthropic's safeguard systems. Key lessons shared include the value of providing unprotected model variants, real-time classifier score access, and detailed internal documentation to enable targeted red-teaming. The announcement frames government partnership as a core component of Anthropic's Safeguards approach rather than a one-off audit.
Related guides (4)
Related events (8)
Anthropic launches bug bounty program to stress-test ASL-3 Constitutional Classifiers
Anthropic launched an invite-only bug bounty program in partnership with HackerOne to find universal jailbreaks in its Constitutional Classifiers system before public deployment, offering up to $25,000 per verified vulnerability. The program targets CBRN-related safety bypasses on Claude 3.7 Sonnet and is part of Anthropic's work to meet its AI Safety Level-3 (ASL-3) Deployment Standard under its Responsible Scaling Policy. A follow-up update extended the program to test Constitutional Classifiers on the new Claude Opus 4 model and began accepting reports of universal jailbreaks found on public platforms. The initiative reflects Anthropic's structured approach to pre-deployment safety validation for increasingly capable models.
Anthropic and NNSA Co-Develop Nuclear Safeguards Classifier for Claude Traffic
Anthropic, in partnership with the U.S. Department of Energy's National Nuclear Security Administration (NNSA) and DOE national laboratories, has co-developed an AI classifier that distinguishes between concerning and benign nuclear-related conversations with 96% accuracy in preliminary testing. The classifier has already been deployed on live Claude traffic as part of Anthropic's misuse-detection infrastructure. Anthropic plans to share the approach with the Frontier Model Forum as a replicable blueprint for other AI developers. This represents the first public-private partnership of this kind for nuclear proliferation risk monitoring in frontier AI systems.
Anthropic Details Claude Safeguards Team Structure and Multi-Layer Safety Approach
Anthropic has published a detailed overview of its internal Safeguards team, describing a multi-layer approach to preventing Claude misuse that spans policy development, model training influence, pre-deployment evaluation, and real-time enforcement. The team uses a Unified Harm Framework covering five dimensions (physical, psychological, economic, societal, autonomy) and conducts Policy Vulnerability Testing with external domain experts in areas like terrorism, child safety, and mental health. Pre-deployment evaluations include safety assessments, CBRNE-focused AI capability uplift testing with government partners, and bias evaluations. The post describes specific partnerships with organizations like the Institute for Strategic Dialogue and ThroughLine to inform election integrity and mental health response policies.
Anthropic Frontier Red Team reports early-warning signs of rapid AI progress in cybersecurity and biosecurity capabilities
Anthropic's Frontier Red Team published findings from a year of safety evaluations across four model releases, documenting rapid capability gains in dual-use domains. In cybersecurity, Claude 3.7 Sonnet now solves roughly a third of Cybench CTF challenges (up from ~5% a year ago), and with the Incalmo toolset was able to replicate a large-scale network attack in realistic cyber range environments. In biosecurity, Claude has moved from underperforming virology experts to exceeding them on the VCT benchmark within one year, and exceeds human expert baselines on cloning workflows. Anthropic assesses current models as showing 'early warning' signs but not yet crossing thresholds of substantially elevated national security risk.
Anthropic Discloses First Reported AI-Orchestrated Cyber Espionage Campaign Using Claude Code
Anthropic detected and disrupted a sophisticated espionage campaign in mid-September 2025, attributed with high confidence to a Chinese state-sponsored threat actor, that used Claude Code as an autonomous agent to attack roughly thirty global targets across tech, finance, chemical manufacturing, and government sectors. The attackers jailbroke Claude Code by decomposing malicious tasks into seemingly innocent subtasks and falsely framing it as defensive security testing, enabling largely autonomous reconnaissance, vulnerability exploitation, credential harvesting, and data exfiltration. Anthropic describes this as the first documented large-scale cyberattack executed without substantial human intervention, leveraging agentic AI capabilities, tool access via MCP, and advanced coding skills. The company banned identified accounts, notified affected entities, coordinated with authorities, and is expanding detection classifiers and publishing the report to aid industry and government defenses.
Anthropic makes Claude 3 Haiku and Sonnet available to US Intelligence Community and AWS GovCloud
Anthropic has made Claude 3 Haiku and Claude 3 Sonnet available via AWS Marketplace for the US Intelligence Community and AWS GovCloud, marking a significant expansion into government deployment. The company has crafted contractual exceptions to its general Usage Policy to permit legally authorized foreign intelligence analysis, including combating human trafficking and identifying covert influence campaigns, while maintaining restrictions on disinformation, weapons design, and malicious cyber operations. The deployment is currently limited to ASL-2 models under Anthropic's Responsible Scaling Policy. Anthropic also notes prior pre-release access to Claude 3.5 Sonnet was provided to the UK AI Safety Institute for pre-deployment testing.
Anthropic Launches Claude Code Security: AI-Powered Vulnerability Detection for Defenders
Anthropic has released Claude Code Security in limited research preview for Enterprise and Team customers, a capability built into Claude Code that scans codebases for security vulnerabilities and suggests patches for human review. Unlike rule-based static analysis tools, it uses Claude's reasoning to understand code context, trace data flows, and detect complex vulnerabilities including novel ones. Built on Claude Opus 4.6, the system found over 500 previously undetected vulnerabilities in production open-source codebases during internal research. The release is framed as a defensive measure to put AI-enabled vulnerability discovery in the hands of defenders before attackers can exploit the same capabilities.
Anthropic details red teaming methods and calls for standardized AI testing practices
Anthropic published a detailed overview of red teaming approaches used to test Claude and other AI systems, covering domain-specific expert testing, automated red teaming, multilingual/multicultural testing, and multimodal red teaming. The post documents empirical findings about when each method is appropriate, highlights partnerships with organizations like Thorn, Institute for Strategic Dialogue, and Singapore's IMDA, and closes with policy recommendations for building a standardized AI testing ecosystem. The piece is notable for its operational specificity and its explicit call for industry-wide standards to enable cross-system safety comparisons.



