Almanac
paper

Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

paperactiveprovisionaladversarial-pragmatics-for-ai-safety-evaluation-a-benchmark-for-instruction-conflict-embedded-commands-and-policy-ambiguity-58188599·1 events·first seen 32h ago

Aliases: Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.CL·32h ago·source ↗

Adversarial Pragmatics benchmark for AI safety evaluation under instruction conflict and ambiguity

A new arXiv preprint introduces 'adversarial pragmatics' as both a benchmark and annotation protocol for evaluating language model behavior under linguistically complex conditions: instruction conflict, embedded commands, quotation, scope ambiguity, deixis, and multi-turn agentic transcripts. The work critiques existing safety benchmarks for collapsing nuanced failure modes into pass/fail labels, and proposes a taxonomy with an 18-item seed benchmark and expert-evaluation protocol that distinguishes task success, policy compliance, safety risk, refusal outcome, and evaluator confidence. The framework is designed to validate safety evals, LLM judges, gold-set construction, and prompt-injection tests. The contribution is primarily methodological, targeting the infrastructure of safety evaluation rather than model capabilities directly.