4AI Snake Oil·1mo ago

Can AI automate computational reproducibility?

This commentary introduces a new benchmark aimed at measuring AI's ability to automate computational reproducibility in scientific research. The piece examines whether AI systems can reliably re-execute and validate scientific computations, a key bottleneck in research integrity. It frames reproducibility automation as a concrete, measurable capability for evaluating AI's impact on science.

Evaluation and Benchmarking Agent and Tool Ecosystem Normal Tech / AI Snake Oil AI Reproducibility Benchmark

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

7Openai Blog·1mo ago·source ↗

PaperBench: OpenAI Benchmark for Evaluating AI Agents on Research Replication

OpenAI introduces PaperBench, a benchmark designed to evaluate AI agents' ability to replicate state-of-the-art AI research papers end-to-end. The benchmark targets a high-complexity capability: reproducing experimental results from frontier AI research, which requires code generation, experimental design, and scientific reasoning. This positions PaperBench as a tool for tracking progress toward autonomous AI research agents.

Evaluation and Benchmarking AI Safety Research OpenAI PaperBench +1 more

8Openai Blog·1mo ago·source ↗

Measuring AI's capability to accelerate biological research

OpenAI introduces a real-world evaluation framework designed to measure how AI systems can accelerate biological research in wet lab settings. The work uses GPT-5 to optimize a molecular cloning protocol as a concrete demonstration case. The framework explicitly addresses both the potential benefits and biosecurity risks of AI-assisted experimentation, positioning this as a dual-use capability assessment.

Frontier Model Releases Evaluation and Benchmarking wet lab biological research evaluation framework OpenAI molecular cloning +3 more

5arXiv · cs.AI·5d ago·source ↗

Decade-long analysis of 56,800 AI conference papers finds sixfold increase in code/data sharing

A new arXiv preprint analyzes documentation and reproducibility practices across 56,800 papers from five leading AI conferences between 2014 and 2024. Code and data sharing rose nearly sixfold from 11% to 64%, with estimated reproducibility increasing from 28% to 64% over the same period. Notably, improvements in documentation practices predate the introduction of formal reproducibility checklists, suggesting the shift reflects a broader open-science movement rather than compliance with venue requirements.

Evaluation and Benchmarking The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers

7arXiv · cs.CL·26d ago·source ↗

Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)

The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.

Frontier Model Releases Evaluation and Benchmarking NeurIPS Auto Benchmark Audit (ABA)SWE-Bench Verified +2 more

5Ai Snake Oil·1mo ago·source ↗

New Paper: Towards a Science of AI Agent Reliability

A new paper proposes a framework for quantifying the gap between AI agent capability and reliability, aiming to establish a more rigorous science of agent dependability. The work addresses the observation that agents may demonstrate high capability on benchmarks while failing unpredictably in deployment. The piece is published via the normaltech.ai newsletter, associated with the AI Snake Oil research commentary tradition.

Evaluation and Benchmarking AI Safety Research Towards a Science of AI Agent Reliability normaltech.ai AI Snake Oil +2 more

5Import Ai·1mo ago·source ↗

Import AI 455: AI systems are about to start building themselves

Import AI issue 455 covers the emerging trend of AI systems automating AI research, framing it as a first step toward recursive self-improvement. The commentary synthesizes recent developments suggesting AI is beginning to participate meaningfully in its own development pipeline. As a tier-2 newsletter, this represents curated analysis of frontier AI research directions rather than primary reporting.

Frontier Model Releases AI Safety Research Recursive Self-Improvement automated AI research Jack Clark +2 more

5arXiv · cs.LG·4d ago·source ↗

ReproRepo: Scalable LLM agent framework for reproducibility auditing using GitHub issues

ReproRepo is a new framework for evaluating LLM agents on reproducibility auditing of ML research, using naturally occurring GitHub issues as supervision signals rather than costly manual curation. The framework is instantiated on 1,149 recent ML papers from major conferences and benchmarks four frontier model-agent configurations. The best-performing agent (Codex with GPT-5.5) surfaces at least one semantically related human-reported reproduction blocker for ~90% of papers, though exact localization of issues remains a weakness. The work provides a reusable, scalable evaluation harness for this underexplored agentic task.

Evaluation and Benchmarking Agent and Tool Ecosystem OpenAI ReproRepo Codex +1 more

4One Useful Thing·1mo ago·source ↗

Giving your AI a Job Interview

This commentary piece argues that as AI-generated advice becomes more consequential, users need systematic methods to evaluate AI reliability and quality—analogous to a job interview process. The author proposes frameworks for assessing AI outputs before trusting them for important decisions. The piece addresses the practical challenge of calibrating trust in AI systems across different use cases.

Evaluation and Benchmarking Enterprise Deployment Patterns Ethan Mollick One Useful Thing