6arXiv cs.CL (Computation and Language)·1mo ago

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

LASH is a black-box jailbreak framework that adaptively composes outputs from multiple existing attack families into hybrid prompts using a genetic optimizer with a two-stage fitness function. Evaluated on JailbreakBench across six target models, LASH achieves 84.5% attack success rate (keyword-based) and 74.5% (LLM-judge) with only 30 mean target queries, outperforming five state-of-the-art baselines. The work demonstrates that no single jailbreak family dominates across models and harm categories, and that adaptive cross-strategy composition is a promising red-teaming direction. Results hold under three defense mechanisms.

Evaluation and Benchmarking AI Safety Research Alignment and RLHF LASH black-box jailbreaking JailbreakBench genetic optimizer LLM-judge scoring

Related guides (3)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.AI·47h ago·source ↗

Study characterizes how mixed compliance demonstrations drive jailbreaking in safety-aligned LLMs

Researchers investigate how language models interpret mixed in-context demonstrations containing both benign and harmful compliance examples, testing three hypotheses about what drives harmful compliance. Across four models, they find benign and harmful demonstrations are not interchangeable, that preference optimization is the critical training stage preventing benign demonstrations from increasing harmful compliance, and that demonstration ordering exhibits strong recency bias. The work moves beyond showing that demonstration-based jailbreaking works to mechanistically characterizing how models extract signals from demonstration content, ordering, and training methodology.

AI Safety Research Alignment and RLHF What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

6The Batch·19d ago·source ↗

Activation Capping Technique Stabilizes LLM Assistant Personas Against Drift and Jailbreaks

Researchers from MATS, Oxford, and Anthropic introduced the 'assistant axis,' a vector derived from LLM layer outputs that quantifies how closely a model adheres to its trained assistant persona. They developed 'activation capping,' an inference-time method that corrects deviations from this axis when similarity falls below a threshold. Testing on Gemma 2 27B, Qwen3 32B, and Llama 3.3 70B showed harmful response rates to jailbreak prompts dropped by roughly half (e.g., 83% to 41% for Qwen3 32B) without degrading benchmark performance. The technique targets character-based jailbreaks that bypass system prompts by manipulating a model's internal representational state.

Evaluation and Benchmarking AI Safety Research Gemma 2 9B assistant axis Llama 3.1 70B +12 more

7arXiv · cs.AI·3d ago·source ↗

Red-team study finds Anthropic Fable 5 and Opus 4.8 remain reliably breakable under automated jailbreak attacks

A preprint evaluates adversarial robustness of two Anthropic frontier models—Fable 5 and Opus 4.8—against four families of automated jailbreak attacks across 7,826 harmful intents. Using the HackAgent framework, the study generated hundreds of thousands of adversarial attempts and confirmed 1,620 harmful completions from Opus 4.8 and 702 from Fable 5 via a three-judge panel. Tree-of-attacks adaptive search achieved 11.5% intent-level success against Opus 4.8 and 6.1% against Fable 5, with static obfuscation nearly fully neutralized. The authors conclude that even the most hardened frontier models remain reliably breakable under sustained automated pressure, cautioning against reading aggregate resistance rates as reassurance.

Frontier Model Releases Evaluation and Benchmarking tree-of-attacks Anthropic Fable 5 Anthropic Opus 4.8 +3 more

4Github Trending·27d ago·source ↗

OBLITERATUS: Open-Source LLM Jailbreak/Red-Teaming Tool by elder-plinius

OBLITERATUS is a Python-based open-source tool by known jailbreak researcher 'elder-plinius' focused on bypassing LLM safety constraints, currently trending on GitHub with 5,684 stars. The project's framing ('obliterate the chains that bind you') signals an adversarial red-teaming or jailbreaking orientation. It represents community-level activity in the ongoing cat-and-mouse dynamic between AI safety guardrails and adversarial circumvention techniques. Limited technical detail is available from the trending snippet alone.

AI Safety Research Agent and Tool Ecosystem OBLITERATUS elder-plinius

6Anthropic News·18d ago·source ↗

Anthropic launches bug bounty program to stress-test ASL-3 Constitutional Classifiers

Anthropic launched an invite-only bug bounty program in partnership with HackerOne to find universal jailbreaks in its Constitutional Classifiers system before public deployment, offering up to $25,000 per verified vulnerability. The program targets CBRN-related safety bypasses on Claude 3.7 Sonnet and is part of Anthropic's work to meet its AI Safety Level-3 (ASL-3) Deployment Standard under its Responsible Scaling Policy. A follow-up update extended the program to test Constitutional Classifiers on the new Claude Opus 4 model and began accepting reports of universal jailbreaks found on public platforms. The initiative reflects Anthropic's structured approach to pre-deployment safety validation for increasingly capable models.

Frontier Model Releases AI Safety Research Constitutional Classifiers Claude Opus 4.6 HackerOne +3 more

5Meta Llama·11d ago·source ↗

Meta releases Llama Prompt Guard 2 (86M) for prompt injection and jailbreak detection

Meta released Llama Prompt Guard 2-86M, a DeBERTa-v2-based text classification model on Hugging Face designed for safety filtering, specifically prompt injection and jailbreak detection. The model is tagged with llama4, suggesting it is part of the Llama 4 safety tooling ecosystem. With over 122K downloads, it has seen meaningful early adoption.

Frontier Model Releases AI Safety Research Hugging Face Llama Prompt Guard 2-86M DeBERTa-v3 +1 more

5Hugging Face Blog·1mo ago·source ↗

CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models

CyberSecEval 2 is a benchmark framework designed to evaluate both the cybersecurity risks and capabilities of large language models. The framework appears to be hosted or featured on Hugging Face's leaderboard infrastructure, extending prior cybersecurity evaluation work. It assesses LLMs across multiple dimensions of security-relevant behavior, including potential for misuse and defensive capabilities.

Evaluation and Benchmarking AI Safety Research CyberSecEval 2 LlamaGuard Hugging Face +1 more

6arXiv · cs.CL·23d ago·source ↗

Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests

This paper introduces a large, consensus-labeled benchmark of 6,675 prompts drawn from eight existing corpora (ASTRA, CySecBench, AdvBench, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) to evaluate whether coding-specialized LLMs refuse malicious requests. A key contribution is the distinction between requests for executable malicious code (4,748 prompts) versus harmful security knowledge (1,923 prompts), arguing that coding models should face a stricter refusal standard given their outputs can be directly weaponized. A five-judge consensus protocol achieves Fleiss' kappa of 0.767, providing a reliability-quantified substrate for cross-corpus compliance measurement that the field has previously lacked.

Evaluation and Benchmarking AI Safety Research Code as a Weapon Prompt Bank CySecBench RedCode +8 more