Entity · technique

Constitutional Classifiers

techniqueactiveconstitutional-classifiers-5b16f055·3 events·first seen Jun 2, 2026

Aliases: Constitutional Classifiers

Co-occurring entities

Claude Opus 4.6 Anthropic Responsible Scaling Policy HackerOne Claude 3.7 Sonnet Claude Sonnet 4 Claude Sonnet 3.7 prompt injection Center for AI Standards and Innovation UK AI Security Institute universal jailbreak

More like this (12)

Constitutional AI Inverse Constitutional AI Selective Classification LLM-based content classification Connectionist Temporal Classification CLASS-PT Label Context Classifier probing classifiers Fundamental limits of distributed multiclass classification from simple binary decisions Riemannian classifiers Grammar-Constrained Decoding Safety Detection Classifier

Recent events (3)

6Anthropic News·Jun 2, 2026·source ↗

Anthropic launches bug bounty program to stress-test ASL-3 Constitutional Classifiers

Anthropic launched an invite-only bug bounty program in partnership with HackerOne to find universal jailbreaks in its Constitutional Classifiers system before public deployment, offering up to $25,000 per verified vulnerability. The program targets CBRN-related safety bypasses on Claude 3.7 Sonnet and is part of Anthropic's work to meet its AI Safety Level-3 (ASL-3) Deployment Standard under its Responsible Scaling Policy. A follow-up update extended the program to test Constitutional Classifiers on the new Claude Opus 4 model and began accepting reports of universal jailbreaks found on public platforms. The initiative reflects Anthropic's structured approach to pre-deployment safety validation for increasingly capable models.

Frontier Model Releases AI Safety Research Constitutional Classifiers Claude Opus 4.6 HackerOne +3 more

8Anthropic News·Jun 2, 2026·source ↗

Anthropic activates ASL-3 safety protections for Claude Opus 4 launch

Anthropic has activated its AI Safety Level 3 (ASL-3) Deployment and Security Standards in conjunction with launching Claude Opus 4, marking the first time any Anthropic model has been deployed under ASL-3 rather than the baseline ASL-2. The activation is described as precautionary: Anthropic has not conclusively determined that Opus 4 crosses the ASL-3 capability threshold, but cannot rule it out due to continued improvements in CBRN-related knowledge. ASL-3 measures include Constitutional Classifiers to block end-to-end CBRN weapon development workflows and enhanced model-weight security against sophisticated non-state attackers. Claude Sonnet 4 was evaluated and cleared for ASL-2, and ASL-4 was ruled out for Opus 4.

Frontier Model Releases AI Safety Research Constitutional Classifiers Claude Sonnet 4 Claude Opus 4.6 +4 more

7Anthropic News·Jun 2, 2026·source ↗

Anthropic Details Collaboration with US CAISI and UK AISI on Constitutional Classifier Red-Teaming

Anthropic has published an account of its ongoing voluntary partnership with the US Center for AI Standards and Innovation (CAISI) and UK AI Security Institute (AISI), in which government red-teamers were given deep access to pre-deployment versions of Constitutional Classifiers used on Claude Opus 4 and 4.1. The collaboration uncovered multiple vulnerability classes including prompt injection bypasses, cipher-based obfuscation attacks, universal jailbreaks via automated attack refinement, and input/output fragmentation exploits, each of which drove architectural improvements to Anthropic's safeguard systems. Key lessons shared include the value of providing unprotected model variants, real-time classifier score access, and detailed internal documentation to enable targeted red-teaming. The announcement frames government partnership as a core component of Anthropic's Safeguards approach rather than a one-off audit.

Frontier Model Releases Evaluation and Benchmarking Constitutional Classifiers prompt injection Claude Opus 4.6 +6 more