OBLITERATUS: Open-Source LLM Jailbreak/Red-Teaming Tool by elder-plinius
OBLITERATUS is a Python-based open-source tool by known jailbreak researcher 'elder-plinius' focused on bypassing LLM safety constraints, currently trending on GitHub with 5,684 stars. The project's framing ('obliterate the chains that bind you') signals an adversarial red-teaming or jailbreaking orientation. It represents community-level activity in the ongoing cat-and-mouse dynamic between AI safety guardrails and adversarial circumvention techniques. Limited technical detail is available from the trending snippet alone.
Related guides (2)
Related events (8)
LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models
LASH is a black-box jailbreak framework that adaptively composes outputs from multiple existing attack families into hybrid prompts using a genetic optimizer with a two-stage fitness function. Evaluated on JailbreakBench across six target models, LASH achieves 84.5% attack success rate (keyword-based) and 74.5% (LLM-judge) with only 30 mean target queries, outperforming five state-of-the-art baselines. The work demonstrates that no single jailbreak family dominates across models and harm categories, and that adaptive cross-strategy composition is a promising red-teaming direction. Results hold under three defense mechanisms.
Red-team study finds Anthropic Fable 5 and Opus 4.8 remain reliably breakable under automated jailbreak attacks
A preprint evaluates adversarial robustness of two Anthropic frontier models—Fable 5 and Opus 4.8—against four families of automated jailbreak attacks across 7,826 harmful intents. Using the HackAgent framework, the study generated hundreds of thousands of adversarial attempts and confirmed 1,620 harmful completions from Opus 4.8 and 702 from Fable 5 via a three-judge panel. Tree-of-attacks adaptive search achieved 11.5% intent-level success against Opus 4.8 and 6.1% against Fable 5, with static obfuscation nearly fully neutralized. The authors conclude that even the most hardened frontier models remain reliably breakable under sustained automated pressure, cautioning against reading aggregate resistance rates as reassurance.
AprielGuard: A Guardrail for Safety and Adversarial Robustness in Modern LLM Systems
ServiceNow AI has released AprielGuard, a guardrail system designed to improve safety and adversarial robustness in LLM deployments. The system targets prompt injection, jailbreaks, and other adversarial inputs that bypass standard safety measures. It is presented as a component for enterprise LLM pipelines seeking more robust content moderation and safety filtering.
Google Study Shows LLM-Generated Malware Is Getting Harder to Track and Stop
A Google security report catalogs emerging LLM-enabled cyberattack techniques including morphing malware with mutation engines, logical-flaw discovery in code, and AI-directed obfuscation networks. The report was prompted in part by a real incident where hackers used an LLM to find a zero-day in a widely used web administration tool. Separately, the UK AI Security Institute found that Claude Mythos Preview and GPT-5.5 can reliably execute attacks expected to take humans 3 hours, up from earlier 1-hour benchmarks, with performance scaling further when token limits are relaxed. The findings suggest an accelerating gap between LLM offensive capability and conventional defensive tooling.
Practitioner spends $1,500 testing LLM offensive security capabilities against a purpose-built vulnerable app
A developer built a deliberately vulnerable application and ran LLMs against it as automated penetration testers, spending $1,500 on API costs across the experiment. The post evaluates how well current LLMs can identify and exploit real vulnerabilities in a controlled setting. Results provide practical signal on the current state of LLM-assisted offensive security, a capability area with both red-team and safety implications.
Anthropic launches bug bounty program to stress-test ASL-3 Constitutional Classifiers
Anthropic launched an invite-only bug bounty program in partnership with HackerOne to find universal jailbreaks in its Constitutional Classifiers system before public deployment, offering up to $25,000 per verified vulnerability. The program targets CBRN-related safety bypasses on Claude 3.7 Sonnet and is part of Anthropic's work to meet its AI Safety Level-3 (ASL-3) Deployment Standard under its Responsible Scaling Policy. A follow-up update extended the program to test Constitutional Classifiers on the new Claude Opus 4 model and began accepting reports of universal jailbreaks found on public platforms. The initiative reflects Anthropic's structured approach to pre-deployment safety validation for increasingly capable models.
Introducing the Red-Teaming Resistance Leaderboard
Hugging Face and Haize Labs have launched a Red-Teaming Resistance Leaderboard to systematically benchmark how well AI models resist adversarial prompting and jailbreak attempts. The leaderboard provides a standardized evaluation framework for comparing model robustness against red-teaming attacks. This fills a gap in the evaluation ecosystem where safety and adversarial robustness metrics have been less formalized than capability benchmarks.
Activation Capping Technique Stabilizes LLM Assistant Personas Against Drift and Jailbreaks
Researchers from MATS, Oxford, and Anthropic introduced the 'assistant axis,' a vector derived from LLM layer outputs that quantifies how closely a model adheres to its trained assistant persona. They developed 'activation capping,' an inference-time method that corrects deviations from this axis when similarity falls below a threshold. Testing on Gemma 2 27B, Qwen3 32B, and Llama 3.3 70B showed harmful response rates to jailbreak prompts dropped by roughly half (e.g., 83% to 41% for Qwen3 32B) without degrading benchmark performance. The technique targets character-based jailbreaks that bypass system prompts by manipulating a model's internal representational state.

