Almanac
benchmark

ReproRepo

benchmarkactiveprovisionalreprorepo-60b5f3e0·1 events·first seen 5h ago

Aliases: ReproRepo

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.LG·5h ago·source ↗

ReproRepo: Scalable LLM agent framework for reproducibility auditing using GitHub issues

ReproRepo is a new framework for evaluating LLM agents on reproducibility auditing of ML research, using naturally occurring GitHub issues as supervision signals rather than costly manual curation. The framework is instantiated on 1,149 recent ML papers from major conferences and benchmarks four frontier model-agent configurations. The best-performing agent (Codex with GPT-5.5) surfaces at least one semantically related human-reported reproduction blocker for ~90% of papers, though exact localization of issues remains a weakness. The work provides a reusable, scalable evaluation harness for this underexplored agentic task.