Almanac
benchmark

PaperBench

benchmarkactivepaperbench-6aea629b·1 events·first seen 28d ago

Aliases: PaperBench

Co-occurring entities

More like this (12)

Recent events (1)

7Openai Blog·28d ago·source ↗

PaperBench: OpenAI Benchmark for Evaluating AI Agents on Research Replication

OpenAI introduces PaperBench, a benchmark designed to evaluate AI agents' ability to replicate state-of-the-art AI research papers end-to-end. The benchmark targets a high-complexity capability: reproducing experimental results from frontier AI research, which requires code generation, experimental design, and scientific reasoning. This positions PaperBench as a tool for tracking progress toward autonomous AI research agents.