Entity · benchmark

JailbreakBench

benchmarkactivejailbreakbench-fbe22b8d·3 events·first seen May 21, 2026

Aliases: JailbreakBench

Co-occurring entities

More like this (12)

Jailbreak Terminal-Bench WildBench PaperBench LiveCodeBench PortBench BigCodeBench SorryBench ChipBench IT-Bench FilBench PinchBench

Recent events (3)

5arXiv · cs.CL·Jun 25, 2026·source ↗

Systematic comparison of encoder vs. decoder safety judges for LLM adversarial evaluation

A new arXiv preprint evaluates whether fine-tuned encoder classifiers from the ModernBERT family (ModernBERT and Ettin) can replace LLM-based safety judges for detecting harmful outputs in user-model conversations. The study benchmarks encoders against rule-based methods, fine-tuned LLM classifiers, and LLM judges including LlamaGuard 3/4, ShieldGemma, StrongReject, and Claude-as-a-judge across multiple adversarial attack types. Results are reported on F1, false negative rate, and precision-recall, with breakdowns by attack technique, providing practical guidance on cost-latency tradeoffs for production safety pipelines.

Evaluation and Benchmarking Inference Economics ModernBERT AILuminate LlamaGuard +6 more

6arXiv · cs.CL·May 28, 2026·source ↗

Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests

This paper introduces a large, consensus-labeled benchmark of 6,675 prompts drawn from eight existing corpora (ASTRA, CySecBench, AdvBench, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) to evaluate whether coding-specialized LLMs refuse malicious requests. A key contribution is the distinction between requests for executable malicious code (4,748 prompts) versus harmful security knowledge (1,923 prompts), arguing that coding models should face a stricter refusal standard given their outputs can be directly weaponized. A five-judge consensus protocol achieves Fleiss' kappa of 0.767, providing a reliability-quantified substrate for cross-corpus compliance measurement that the field has previously lacked.

Evaluation and Benchmarking AI Safety Research Code as a Weapon Prompt Bank CySecBench RedCode +8 more

6arXiv · cs.CL·May 21, 2026·source ↗

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

LASH is a black-box jailbreak framework that adaptively composes outputs from multiple existing attack families into hybrid prompts using a genetic optimizer with a two-stage fitness function. Evaluated on JailbreakBench across six target models, LASH achieves 84.5% attack success rate (keyword-based) and 74.5% (LLM-judge) with only 30 mean target queries, outperforming five state-of-the-art baselines. The work demonstrates that no single jailbreak family dominates across models and harm categories, and that adaptive cross-strategy composition is a promising red-teaming direction. Results hold under three defense mechanisms.

Evaluation and Benchmarking AI Safety Research LASH black-box jailbreaking JailbreakBench +3 more