benchmark

NatureBench

benchmarkactiveprovisionalnaturebench-0a401598·1 events·first seen 20h ago

Aliases: NatureBench

Co-occurring entities

More like this (12)

LiveBench RoleBench StakeBench SupraBench WildBench AdversaBench DeliveryBench ATE-Bench PaperBench AdvBench TokenBench HeraBench

Recent events (1)

7arXiv · cs.CL·20h ago·source ↗

NatureBench: Coding agents surpass published SOTA on only 17.8% of real scientific tasks from Nature-family papers

NatureBench introduces a 90-task benchmark derived from peer-reviewed Nature-family publications to evaluate whether AI coding agents can advance beyond reproduction toward genuine scientific discovery. Built on NatureGym, an automated pipeline that creates containerized per-task environments, the benchmark addresses environment fragmentation that has undermined prior agent-on-research evaluations. Evaluating ten frontier agent configurations under a web-search-disabled protocol, the strongest model exceeds published SOTA on only 17.8% of tasks, with failures driven primarily by wrong method choice and insufficient compute rather than task misunderstanding. Agents succeed mainly through methodological translation—recasting scientific problems as supervised prediction—rather than genuine scientific invention.

Evaluation and Benchmarking Agent and Tool Ecosystem NatureGym FrontisAI NatureBench