paper

Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

paperactiveprovisional

adversarial-pragmatics-for-ai-safety-evaluation-a-benchmark-for-instruction-conflict-embedded-commands-and-policy-ambiguity-58188599

·1 events·first seen 32h ago

Aliases: Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

Co-occurring entities

adversarial pragmatics

More like this (12)

Concrete Problems in AI Safety AI Safety via Debate adversarial pragmatics Debate (AI safety technique)Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting third-party AI evaluations The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems Clinician-Level Agreement Without Clinical Caution: LLM Evaluator Limits in Medical AI Benchmarking A Causal Model of Theory of Mind in Conflict for Artificial Intelligence APPO: Agentic Procedural Policy Optimization Adversarial Attacks on Neural Network Policies G7 International Code of Conduct for Advanced AI Systems

Recent events (1)

5arXiv · cs.CL·32h ago·source ↗

Adversarial Pragmatics benchmark for AI safety evaluation under instruction conflict and ambiguity

A new arXiv preprint introduces 'adversarial pragmatics' as both a benchmark and annotation protocol for evaluating language model behavior under linguistically complex conditions: instruction conflict, embedded commands, quotation, scope ambiguity, deixis, and multi-turn agentic transcripts. The work critiques existing safety benchmarks for collapsing nuanced failure modes into pass/fail labels, and proposes a taxonomy with an 18-item seed benchmark and expert-evaluation protocol that distinguishes task success, policy compliance, safety risk, refusal outcome, and evaluator confidence. The framework is designed to validate safety evals, LLM judges, gold-set construction, and prompt-injection tests. The contribution is primarily methodological, targeting the infrastructure of safety evaluation rather than model capabilities directly.

Evaluation and Benchmarking AI Safety Research adversarial pragmatics Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity +1 more