Entity · organization

ICLR

organizationactiveiclr-e0f37890·1 events·first seen May 29, 2026

Aliases: ICLR

Co-occurring entities

optimism bias SoundnessBench

More like this (12)

Pose-ICL ICML MLIR IFLLM CICIDS RSICD CM-LRS HCIG RMISC iCAD RELAI CLIP

Recent events (1)

6arXiv · cs.LG·May 29, 2026·source ↗

SoundnessBench: Benchmarking LLMs as Evaluators of ML Research Proposal Viability

SoundnessBench is a new benchmark of 1,099 machine-learning research proposals derived from ICLR submissions, labeled with reviewer soundness scores, designed to test whether LLMs can reliably distinguish methodologically sound research ideas from unsound ones. Evaluated across 12 frontier LLMs, the benchmark reveals a pervasive optimism bias: models systematically rate low-soundness proposals as sound under standard prompting, with aggressive prompting shifting errors from false positives to false negatives rather than eliminating them. Controls for data contamination, surface features, and human audit quality suggest the bias is not attributable to a single confounder. The authors conclude that current LLMs are not yet reliable as standalone first-gate evaluators of scientific rigor, a critical bottleneck for autonomous AI research agents.

Evaluation and Benchmarking AI Safety Research ICLR optimism bias SoundnessBench +1 more