Adversarial Pragmatics benchmark for AI safety evaluation under instruction conflict and ambiguity
A new arXiv preprint introduces 'adversarial pragmatics' as both a benchmark and annotation protocol for evaluating language model behavior under linguistically complex conditions: instruction conflict, embedded commands, quotation, scope ambiguity, deixis, and multi-turn agentic transcripts. The work critiques existing safety benchmarks for collapsing nuanced failure modes into pass/fail labels, and proposes a taxonomy with an 18-item seed benchmark and expert-evaluation protocol that distinguishes task success, policy compliance, safety risk, refusal outcome, and evaluator confidence. The framework is designed to validate safety evals, LLM judges, gold-set construction, and prompt-injection tests. The contribution is primarily methodological, targeting the infrastructure of safety evaluation rather than model capabilities directly.
Related guides (3)
Related events (8)
Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)
The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.
AdversaBench: Automated LLM red-teaming pipeline with multi-judge confirmation and cross-model transferability
AdversaBench is a new end-to-end red-teaming pipeline that mutates seed prompts using five structured operators and confirms failures via a three-judge panel with a meta-judge tiebreaker. Experiments on 45 seeds across reasoning, instruction-following, and tool-use categories produced confirmed failures for every seed. Key findings include sharp variation in operator effectiveness by category, misleading binary failure rates, judge agreement metrics distorted by label skew, and zero-shot transferability of adversarial prompts from Llama 3.1 8B to Llama 3.3 70B. Code and dataset are publicly released.
EvalSafetyGap: Conceptual framework linking LLM evaluation failures to safety measurement gaps
A new arXiv preprint introduces EvalSafetyGap, a hybrid survey and conceptual framework arguing that benchmark scores, reward-model signals, and safety metrics can improve while the underlying properties they measure remain unverified. The paper synthesizes eight evidence streams spanning 2018–2026 and introduces two analytical constructs — an Instability Decomposition and an Alignment Trilemma — to structure comparisons between evaluation-side and alignment-side proxy failures under optimization pressure. A ten-model audit finds no statistically significant association between capability and adversarial robustness, and suggests the apparent open-versus-closed-model safety gap is driven more by governance and disclosure practices than behavioral robustness. The work proposes a shared vocabulary for dynamic evaluation, multi-attempt safety measurement, and auditable alignment practice.
An Introduction to AI Secure LLM Safety Leaderboard
Hugging Face introduces the DecodingTrust-based LLM Safety Leaderboard, a benchmark framework for evaluating large language models across multiple safety and trustworthiness dimensions. The leaderboard aims to provide standardized, reproducible safety assessments covering areas such as toxicity, stereotype bias, adversarial robustness, and privacy. It offers a public ranking of models to help researchers and practitioners compare safety properties across different LLMs.
AI Safety via Debate
OpenAI proposes a safety technique in which two AI agents debate a topic and a human judge determines the winner, with the goal of making it easier for humans to supervise AI systems that may be more capable than themselves. The core intuition is that it is easier to verify a correct argument than to generate one, so a dishonest agent can be caught by an honest opponent. The paper introduces debate as a scalable oversight mechanism applicable to complex tasks where direct human evaluation is infeasible.
EMPATH: Multilingual multi-turn safety benchmark for emotional-support chatbots reveals score inflation and run-to-run reliability failures
EMPATH is a new arXiv benchmark for evaluating the safety of emotional-support chatbots, using an auditor model to generate multi-turn crisis conversations and a calibrated judge model to score transcripts across 19 metrics in five dimensions. Built for Mexican Spanish and US English, the benchmark surfaces score inflation on 10 of 19 metrics under uncalibrated rubrics and finds that run-to-run reliability is a per-model safety property: one model swings 2–10 points on a crisis metric across identical reruns, and DeepSeek V4 Pro produces different conversations at temperature 0. Evaluation of three frontier models shows aggregate scores within 0.74 points but per-metric divergences up to six points, with rankings stable across a cross-family judge at 93% within ±1.
Adversarial robustness and safety alignment in multilingual multimodal LLMs: cross-lingual vulnerability and 'safety-by-failure'
A systematic study evaluates adversarial robustness and safety alignment of multimodal LLMs across 12 languages, finding that adversarial images optimized in one language transfer to others (cross-lingual transferability). The paper introduces the concept of 'safety-by-failure': low-resource languages appear safer not due to genuine alignment but because models fail to comprehend harmful instructions in those languages. Models like Qwen3-VL that integrate multilingual capability throughout training (rather than only at instruction tuning) show genuine cross-lingual safety with active refusal. The findings challenge the assumption that low-resource language safety metrics reflect real alignment.
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
Researchers introduce 'Boiling the Frog,' a multi-turn safety benchmark evaluating whether tool-using AI agents in corporate/office settings are susceptible to incremental attacks that begin with benign requests before introducing harmful payloads. The benchmark uses stateful multi-turn evaluation with a three-level operational risk taxonomy grounded in the EU AI Act and its GPAI Code of Practice. Across nine models, aggregate strict attack success rate is 44.4%, ranging from 20.5% for Claude Haiku 4.5 to 92.9% for Gemini 3.1 Flash Lite, with loss-of-control scenarios reaching 93.3% category-level ASR.


