LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models
LASH is a black-box jailbreak framework that adaptively composes outputs from multiple existing attack families into hybrid prompts using a genetic optimizer with a two-stage fitness function. Evaluated on JailbreakBench across six target models, LASH achieves 84.5% attack success rate (keyword-based) and 74.5% (LLM-judge) with only 30 mean target queries, outperforming five state-of-the-art baselines. The work demonstrates that no single jailbreak family dominates across models and harm categories, and that adaptive cross-strategy composition is a promising red-teaming direction. Results hold under three defense mechanisms.
Related guides (3)
Related events (8)
Study characterizes how mixed compliance demonstrations drive jailbreaking in safety-aligned LLMs
Researchers investigate how language models interpret mixed in-context demonstrations containing both benign and harmful compliance examples, testing three hypotheses about what drives harmful compliance. Across four models, they find benign and harmful demonstrations are not interchangeable, that preference optimization is the critical training stage preventing benign demonstrations from increasing harmful compliance, and that demonstration ordering exhibits strong recency bias. The work moves beyond showing that demonstration-based jailbreaking works to mechanistically characterizing how models extract signals from demonstration content, ordering, and training methodology.
Activation Capping Technique Stabilizes LLM Assistant Personas Against Drift and Jailbreaks
Researchers from MATS, Oxford, and Anthropic introduced the 'assistant axis,' a vector derived from LLM layer outputs that quantifies how closely a model adheres to its trained assistant persona. They developed 'activation capping,' an inference-time method that corrects deviations from this axis when similarity falls below a threshold. Testing on Gemma 2 27B, Qwen3 32B, and Llama 3.3 70B showed harmful response rates to jailbreak prompts dropped by roughly half (e.g., 83% to 41% for Qwen3 32B) without degrading benchmark performance. The technique targets character-based jailbreaks that bypass system prompts by manipulating a model's internal representational state.
Red-team study finds Anthropic Fable 5 and Opus 4.8 remain reliably breakable under automated jailbreak attacks
A preprint evaluates adversarial robustness of two Anthropic frontier models—Fable 5 and Opus 4.8—against four families of automated jailbreak attacks across 7,826 harmful intents. Using the HackAgent framework, the study generated hundreds of thousands of adversarial attempts and confirmed 1,620 harmful completions from Opus 4.8 and 702 from Fable 5 via a three-judge panel. Tree-of-attacks adaptive search achieved 11.5% intent-level success against Opus 4.8 and 6.1% against Fable 5, with static obfuscation nearly fully neutralized. The authors conclude that even the most hardened frontier models remain reliably breakable under sustained automated pressure, cautioning against reading aggregate resistance rates as reassurance.
OBLITERATUS: Open-Source LLM Jailbreak/Red-Teaming Tool by elder-plinius
OBLITERATUS is a Python-based open-source tool by known jailbreak researcher 'elder-plinius' focused on bypassing LLM safety constraints, currently trending on GitHub with 5,684 stars. The project's framing ('obliterate the chains that bind you') signals an adversarial red-teaming or jailbreaking orientation. It represents community-level activity in the ongoing cat-and-mouse dynamic between AI safety guardrails and adversarial circumvention techniques. Limited technical detail is available from the trending snippet alone.
Anthropic launches bug bounty program to stress-test ASL-3 Constitutional Classifiers
Anthropic launched an invite-only bug bounty program in partnership with HackerOne to find universal jailbreaks in its Constitutional Classifiers system before public deployment, offering up to $25,000 per verified vulnerability. The program targets CBRN-related safety bypasses on Claude 3.7 Sonnet and is part of Anthropic's work to meet its AI Safety Level-3 (ASL-3) Deployment Standard under its Responsible Scaling Policy. A follow-up update extended the program to test Constitutional Classifiers on the new Claude Opus 4 model and began accepting reports of universal jailbreaks found on public platforms. The initiative reflects Anthropic's structured approach to pre-deployment safety validation for increasingly capable models.
Meta releases Llama Prompt Guard 2 (86M) for prompt injection and jailbreak detection
Meta released Llama Prompt Guard 2-86M, a DeBERTa-v2-based text classification model on Hugging Face designed for safety filtering, specifically prompt injection and jailbreak detection. The model is tagged with llama4, suggesting it is part of the Llama 4 safety tooling ecosystem. With over 122K downloads, it has seen meaningful early adoption.
CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models
CyberSecEval 2 is a benchmark framework designed to evaluate both the cybersecurity risks and capabilities of large language models. The framework appears to be hosted or featured on Hugging Face's leaderboard infrastructure, extending prior cybersecurity evaluation work. It assesses LLMs across multiple dimensions of security-relevant behavior, including potential for misuse and defensive capabilities.
Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests
This paper introduces a large, consensus-labeled benchmark of 6,675 prompts drawn from eight existing corpora (ASTRA, CySecBench, AdvBench, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) to evaluate whether coding-specialized LLMs refuse malicious requests. A key contribution is the distinction between requests for executable malicious code (4,748 prompts) versus harmful security knowledge (1,923 prompts), arguing that coding models should face a stricter refusal standard given their outputs can be directly weaponized. A five-judge consensus protocol achieves Fleiss' kappa of 0.767, providing a reliability-quantified substrate for cross-corpus compliance measurement that the field has previously lacked.


