benchmark
PaperBench
benchmarkactive
paperbench-6aea629b·1 events·first seen 28d agoAliases: PaperBench
Co-occurring entities
More like this (12)
Recent events (1)
PaperBench: OpenAI Benchmark for Evaluating AI Agents on Research Replication
OpenAI introduces PaperBench, a benchmark designed to evaluate AI agents' ability to replicate state-of-the-art AI research papers end-to-end. The benchmark targets a high-complexity capability: reproducing experimental results from frontier AI research, which requires code generation, experimental design, and scientific reasoning. This positions PaperBench as a tool for tracking progress toward autonomous AI research agents.