Evaluation Awareness Is Not One Capability: Evidence from Open Language Models
evaluation-awareness-is-not-one-capability-evidence-from-open-language-models-0e0314c3·1 events·first seen 39h agoAliases: Evaluation Awareness Is Not One Capability: Evidence from Open Language Models
Co-occurring entities
More like this (12)
Recent events (1)
Evaluation awareness in LLMs is multidimensional, not a single capability — evidence from 37 open models
A new arXiv paper characterizes 'evaluation awareness' — the ability of models to detect they are being tested and adapt behavior accordingly — across 37 open-weight models and 7 families using 8 experiments. Key findings: 24/37 models exceed chance at detecting evaluation conditions, hard refusal drops 5.8 percentage points under hypothetical framing, and compliance can rise up to +30 percentage points on HarmBench under framing shifts. Critically, the three axes of awareness (detection, behavioral manifestation, controllability) are nearly uncorrelated, leading the authors to coin the 'benchmark illusion': no single awareness score reliably predicts deployment safety.