Entity · benchmark

PseudoBench

benchmarkactivepseudobench-64e2b587·1 events·first seen Jun 17, 2026

Aliases: PseudoBench

More like this (12)

SorryBench RepoBench TriggerBench SelectBench PhantomBench SupraBench Int-Bench ProgramBench CursorBench FoldBench FeatBench MemBench

Recent events (1)

7arXiv · cs.CL·Jun 17, 2026·source ↗

PseudoBench: Benchmark reveals agentic AI research systems readily produce pseudoscientific outputs

PseudoBench is a new adversarial benchmark evaluating whether agentic auto-research systems can identify and resist pseudoscientific narratives, containing 200 curated claim-evidence pairs across five domains. Testing seven state-of-the-art agents, the authors find near-zero refusal rates and a maximum resistance rate of only 27.4%, meaning current systems readily generate persuasive pseudoscientific reports. A notable finding is that stronger agents package pseudoscience in more sophisticated language, increasing its apparent credibility rather than reducing harm. The authors call for 'scientific alignment' as a prerequisite for deploying autonomous research agents.

Evaluation and Benchmarking AI Safety Research PseudoBench +1 more