Entity · benchmark

PaperBench

benchmarkactivepaperbench-6aea629b·2 events·first seen May 20, 2026

Aliases: PaperBench

Co-occurring entities

More like this (12)

WildBench PortBench SorryBench FinBench Terminal-Bench FoldBench FilBench IT-Bench SpecBench ChipBench JailbreakBench EdgeBench

Recent events (2)

4arXiv · cs.CL·Jul 15, 2026·source ↗

Meta-evaluation of LLM-generated rubrics for paper reproduction benchmarks

A new arXiv preprint presents the first systematic meta-evaluation of LLM-generated rubrics for assessing paper reproduction tasks, addressing scalability limitations of expert-constructed rubrics in benchmarks like PaperBench. The authors test four generation settings across two backbone models, evaluating rubric quality both intrinsically (semantic similarity) and extrinsically (score alignment with ground-truth). Results show augmented generation settings can approach human-baseline alignment, but LLM-generated rubrics tend to be overly fine-grained, score-biased, and domain-insensitive.

Evaluation and Benchmarking Agent and Tool Ecosystem PaperBench

7Openai Blog·May 20, 2026·source ↗

PaperBench: OpenAI Benchmark for Evaluating AI Agents on Research Replication

OpenAI introduces PaperBench, a benchmark designed to evaluate AI agents' ability to replicate state-of-the-art AI research papers end-to-end. The benchmark targets a high-complexity capability: reproducing experimental results from frontier AI research, which requires code generation, experimental design, and scientific reasoning. This positions PaperBench as a tool for tracking progress toward autonomous AI research agents.

Evaluation and Benchmarking AI Safety Research OpenAI PaperBench +1 more