Almanac
benchmark

NatureBench

benchmarkactiveprovisionalnaturebench-0a401598·1 events·first seen 20h ago

Aliases: NatureBench

Co-occurring entities

More like this (12)

Recent events (1)

7arXiv · cs.CL·20h ago·source ↗

NatureBench: Coding agents surpass published SOTA on only 17.8% of real scientific tasks from Nature-family papers

NatureBench introduces a 90-task benchmark derived from peer-reviewed Nature-family publications to evaluate whether AI coding agents can advance beyond reproduction toward genuine scientific discovery. Built on NatureGym, an automated pipeline that creates containerized per-task environments, the benchmark addresses environment fragmentation that has undermined prior agent-on-research evaluations. Evaluating ten frontier agent configurations under a web-search-disabled protocol, the strongest model exceeds published SOTA on only 17.8% of tasks, with failures driven primarily by wrong method choice and insufficient compute rather than task misunderstanding. Agents succeed mainly through methodological translation—recasting scientific problems as supervised prediction—rather than genuine scientific invention.