NatureBench: Coding agents surpass published SOTA on only 17.8% of real scientific tasks from Nature-family papers
NatureBench introduces a 90-task benchmark derived from peer-reviewed Nature-family publications to evaluate whether AI coding agents can advance beyond reproduction toward genuine scientific discovery. Built on NatureGym, an automated pipeline that creates containerized per-task environments, the benchmark addresses environment fragmentation that has undermined prior agent-on-research evaluations. Evaluating ten frontier agent configurations under a web-search-disabled protocol, the strongest model exceeds published SOTA on only 17.8% of tasks, with failures driven primarily by wrong method choice and insufficient compute rather than task misunderstanding. Agents succeed mainly through methodological translation—recasting scientific problems as supervised prediction—rather than genuine scientific invention.
Related guides (2)
Related events (8)
TxBench-PP: New benchmark reveals AI agents struggle with preclinical pharmacology decisions
Researchers introduce TxBench-PP (TherapeuticsBench Preclinical Pharmacology), a 100-evaluation benchmark testing AI agents on realistic small-molecule drug discovery tasks including mechanism-of-action reasoning, compound-target engagement, and translational efficacy. Agents receive real workflow snapshots and are graded deterministically on structured answers. Across 16 model-harness configurations and 4,800 trajectories, no system reliably succeeded; the best performer, Claude Opus 4.8 with the Pi harness, passed only 59.3% of endpoint attempts. The results suggest current frontier models are not yet deployment-ready for autonomous preclinical pharmacology decision-making.
PseudoBench: Benchmark reveals agentic AI research systems readily produce pseudoscientific outputs
PseudoBench is a new adversarial benchmark evaluating whether agentic auto-research systems can identify and resist pseudoscientific narratives, containing 200 curated claim-evidence pairs across five domains. Testing seven state-of-the-art agents, the authors find near-zero refusal rates and a maximum resistance rate of only 27.4%, meaning current systems readily generate persuasive pseudoscientific reports. A notable finding is that stronger agents package pseudoscience in more sophisticated language, increasing its apparent credibility rather than reducing harm. The authors call for 'scientific alignment' as a prerequisite for deploying autonomous research agents.
PaperBench: OpenAI Benchmark for Evaluating AI Agents on Research Replication
OpenAI introduces PaperBench, a benchmark designed to evaluate AI agents' ability to replicate state-of-the-art AI research papers end-to-end. The benchmark targets a high-complexity capability: reproducing experimental results from frontier AI research, which requires code generation, experimental design, and scientific reasoning. This positions PaperBench as a tool for tracking progress toward autonomous AI research agents.
DeepSWE, ProgramBench, and ITBench-AA emerge as harder successors to SWE-bench for agent evaluation
Three new benchmarks — DeepSWE (by Datacurve), ProgramBench (Meta/Stanford/Harvard), and ITBench-AA (IBM/Artificial Analysis) — are positioned as more rigorous replacements for the SWE-bench family, which models have largely saturated. DeepSWE tests feature implementation using private codebases and human-written problems; ProgramBench evaluates agents' ability to recreate functional programs from scratch; ITBench-AA measures root-cause diagnosis in real-world IT incident scenarios. Current top performers include GPT-5.5 (70% on DeepSWE), Claude Opus 4.7 (46.7% on ITBench-AA), and Claude Opus 4.7 (3% on ProgramBench at the 95% pass threshold), illustrating that even frontier models have substantial headroom.
Benchmark Agent: Autonomous system for end-to-end benchmark construction
Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.
AARRI-Bench evaluates frontier LLMs and agents on granular research-intern-level tasks
Researchers introduce AARR (Act As a Real Researcher), a new benchmark series targeting whether AI agents can emulate the professionalism, thoroughness, and nuanced judgment of human researchers in granular research scenarios—not just macro-level task execution. The first benchmark, AARRI-Bench, tests frontier models and agentic harnesses, finding that even the best configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3% success, frequently missing subtle but critical details obvious to human researchers. The work argues that closing the gap requires deeper modeling of research behavior rather than more complex scaffolding.
SWE-Explore: New benchmark isolates repository exploration capability in coding agents
SWE-Explore is a new benchmark targeting repository exploration as a distinct, fine-grained capability of coding agents, separate from end-to-end task resolution. It covers 848 issues across 10 programming languages and 203 open-source repositories, with line-level ground truth derived from successful agent trajectories. Evaluation across retrieval methods, coding agents, and specialized localizers finds that agentic explorers outperform classical retrieval, and that line-level coverage and efficient ranking remain the key differentiators at the frontier. The benchmark addresses a gap in SWE-bench-style evaluations that treat task resolution as a binary outcome.
ABC-Bench: Agentic biosecurity benchmark finds LLM agents surpass median expert humans on dual-use biology tasks
Researchers introduce ABC-Bench, a benchmark evaluating LLM agents on biosecurity-relevant biology tasks including liquid-handling robot programming, DNA fragment design, and evasion of DNA synthesis screening. All tested agents outperformed the median expert human baseline across all three tasks. Wet-lab validation confirmed that OpenAI's o4-mini-high produced scripts that successfully assembled DNA on an OpenTrons robot. The results highlight a meaningful shift in the biosecurity risk landscape as AI agents acquire practical wet-lab-adjacent capabilities.

