5arXiv cs.CL (Computation and Language)·32h ago

Adversarial Pragmatics benchmark for AI safety evaluation under instruction conflict and ambiguity

A new arXiv preprint introduces 'adversarial pragmatics' as both a benchmark and annotation protocol for evaluating language model behavior under linguistically complex conditions: instruction conflict, embedded commands, quotation, scope ambiguity, deixis, and multi-turn agentic transcripts. The work critiques existing safety benchmarks for collapsing nuanced failure modes into pass/fail labels, and proposes a taxonomy with an 18-item seed benchmark and expert-evaluation protocol that distinguishes task success, policy compliance, safety risk, refusal outcome, and evaluator confidence. The framework is designed to validate safety evals, LLM judges, gold-set construction, and prompt-injection tests. The contribution is primarily methodological, targeting the infrastructure of safety evaluation rather than model capabilities directly.

Evaluation and Benchmarking AI Safety Research Agent and Tool Ecosystem adversarial pragmatics Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

Related guides (3)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Principles to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

AI Evaluation and Benchmarking: From Leaderboards to the Limits of Measurement

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

7arXiv · cs.CL·1mo ago·source ↗

Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)

The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.

Frontier Model Releases Evaluation and Benchmarking NeurIPS Auto Benchmark Audit (ABA)SWE-Bench Verified +2 more

5arXiv · cs.CL·9d ago·source ↗

AdversaBench: Automated LLM red-teaming pipeline with multi-judge confirmation and cross-model transferability

AdversaBench is a new end-to-end red-teaming pipeline that mutates seed prompts using five structured operators and confirms failures via a three-judge panel with a meta-judge tiebreaker. Experiments on 45 seeds across reasoning, instruction-following, and tool-use categories produced confirmed failures for every seed. Key findings include sharp variation in operator effectiveness by category, misleading binary failure rates, judge agreement metrics distorted by label skew, and zero-shot transferability of adversarial prompts from Llama 3.1 8B to Llama 3.3 70B. Code and dataset are publicly released.

Evaluation and Benchmarking AI Safety Research Llama 3.1 70B AdversaBench Meta +1 more

5arXiv · cs.AI·3d ago·source ↗

EvalSafetyGap: Conceptual framework linking LLM evaluation failures to safety measurement gaps

A new arXiv preprint introduces EvalSafetyGap, a hybrid survey and conceptual framework arguing that benchmark scores, reward-model signals, and safety metrics can improve while the underlying properties they measure remain unverified. The paper synthesizes eight evidence streams spanning 2018–2026 and introduces two analytical constructs — an Instability Decomposition and an Alignment Trilemma — to structure comparisons between evaluation-side and alignment-side proxy failures under optimization pressure. A ten-model audit finds no statistically significant association between capability and adversarial robustness, and suggests the apparent open-versus-closed-model safety gap is driven more by governance and disclosure practices than behavioral robustness. The work proposes a shared vocabulary for dynamic evaluation, multi-attempt safety measurement, and auditable alignment practice.

Evaluation and Benchmarking AI Safety Research Goodhart's Law EvalSafetyGap +1 more

5Hugging Face Blog·1mo ago·source ↗

An Introduction to AI Secure LLM Safety Leaderboard

Hugging Face introduces the DecodingTrust-based LLM Safety Leaderboard, a benchmark framework for evaluating large language models across multiple safety and trustworthiness dimensions. The leaderboard aims to provide standardized, reproducible safety assessments covering areas such as toxicity, stereotype bias, adversarial robustness, and privacy. It offers a public ranking of models to help researchers and practitioners compare safety properties across different LLMs.

Evaluation and Benchmarking AI Safety Research LLM Safety Leaderboard Hugging Face DecodingTrust

6Openai Blog·1mo ago·source ↗

AI Safety via Debate

OpenAI proposes a safety technique in which two AI agents debate a topic and a human judge determines the winner, with the goal of making it easier for humans to supervise AI systems that may be more capable than themselves. The core intuition is that it is easier to verify a correct argument than to generate one, so a dishonest agent can be caught by an honest opponent. The paper introduces debate as a scalable oversight mechanism applicable to complex tasks where direct human evaluation is infeasible.

Evaluation and Benchmarking AI Safety Research AI Safety via Debate Debate (AI safety technique)OpenAI +2 more

6arXiv · cs.AI·3d ago·source ↗

EMPATH: Multilingual multi-turn safety benchmark for emotional-support chatbots reveals score inflation and run-to-run reliability failures

EMPATH is a new arXiv benchmark for evaluating the safety of emotional-support chatbots, using an auditor model to generate multi-turn crisis conversations and a calibrated judge model to score transcripts across 19 metrics in five dimensions. Built for Mexican Spanish and US English, the benchmark surfaces score inflation on 10 of 19 metrics under uncalibrated rubrics and finds that run-to-run reliability is a per-model safety property: one model swings 2–10 points on a crisis metric across identical reruns, and DeepSeek V4 Pro produces different conversations at temperature 0. Evaluation of three frontier models shows aggregate scores within 0.74 points but per-metric divergences up to six points, with rankings stable across a cross-family judge at 93% within ±1.

Evaluation and Benchmarking AI Safety Research EMPATH DeepSeek V4

6arXiv · cs.CL·29d ago·source ↗

Adversarial robustness and safety alignment in multilingual multimodal LLMs: cross-lingual vulnerability and 'safety-by-failure'

A systematic study evaluates adversarial robustness and safety alignment of multimodal LLMs across 12 languages, finding that adversarial images optimized in one language transfer to others (cross-lingual transferability). The paper introduces the concept of 'safety-by-failure': low-resource languages appear safer not due to genuine alignment but because models fail to comprehend harmful instructions in those languages. Models like Qwen3-VL that integrate multilingual capability throughout training (rather than only at instruction tuning) show genuine cross-lingual safety with active refusal. The findings challenge the assumption that low-resource language safety metrics reflect real alignment.

Evaluation and Benchmarking AI Safety Research Qwen3-4B Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models +1 more

7arXiv · cs.CL·1mo ago·source ↗

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

Researchers introduce 'Boiling the Frog,' a multi-turn safety benchmark evaluating whether tool-using AI agents in corporate/office settings are susceptible to incremental attacks that begin with benign requests before introducing harmful payloads. The benchmark uses stateful multi-turn evaluation with a three-level operational risk taxonomy grounded in the EU AI Act and its GPAI Code of Practice. Across nine models, aggregate strict attack success rate is 44.4%, ranging from 20.5% for Claude Haiku 4.5 to 92.9% for Gemini 3.1 Flash Lite, with loss-of-control scenarios reaching 93.3% category-level ASR.

Evaluation and Benchmarking AI Safety Research Seed 2.0 Lite Claude Haiku 4.5 EU AI Act +7 more