5arXiv cs.AI (Artificial Intelligence)·4d ago

Decade-long analysis of 56,800 AI conference papers finds sixfold increase in code/data sharing

A new arXiv preprint analyzes documentation and reproducibility practices across 56,800 papers from five leading AI conferences between 2014 and 2024. Code and data sharing rose nearly sixfold from 11% to 64%, with estimated reproducibility increasing from 28% to 64% over the same period. Notably, improvements in documentation practices predate the introduction of formal reproducibility checklists, suggesting the shift reflects a broader open-science movement rather than compliance with venue requirements.

Evaluation and Benchmarking The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4Ai Snake Oil·1mo ago·source ↗

Can AI automate computational reproducibility?

This commentary introduces a new benchmark aimed at measuring AI's ability to automate computational reproducibility in scientific research. The piece examines whether AI systems can reliably re-execute and validate scientific computations, a key bottleneck in research integrity. It frames reproducibility automation as a concrete, measurable capability for evaluating AI's impact on science.

Evaluation and Benchmarking Agent and Tool Ecosystem Normal Tech / AI Snake Oil AI Reproducibility Benchmark

5arXiv · cs.AI·5d ago·source ↗

Taxonomy and governance gap analysis for AI contributors in open-source software

A preprint from arXiv analyzes how open-source organizations are handling AI-generated and agent-driven contributions, comparing policies across six major projects (SymPy, LLVM, matplotlib, OpenInfra, Apache Software Foundation, Linux Foundation). The authors develop a six-dimensional taxonomy covering disclosure, responsibility, human oversight, licensing, enforcement, and maintainer workload, and score each organization's policy maturity. The paper maps documented agent incidents onto governance gaps and identifies misalignments with emerging regulatory frameworks including the EU AI Act, NIST AI RMF, and ISO/IEC 42001, proposing a harmonized tiered framework.

AI Safety Research Regulatory Developments LLVM Linux Foundation NIST AI RMF +6 more

7Openai Blog·1mo ago·source ↗

AI and Compute: OpenAI Analysis of Exponential Growth in Training Compute Since 2012

OpenAI published an analysis in May 2018 showing that compute used in the largest AI training runs has been doubling every 3.4 months since 2012, far outpacing Moore's Law's 2-year doubling period. Over the 2012–2018 period, this metric grew by more than 300,000x. The analysis frames compute scaling as a key driver of AI progress and argues for preparing for systems with capabilities well beyond those of the time.

Training Infrastructure Frontier Model Releases Moore's Law OpenAI AI and Compute +1 more

5Openai Blog·1mo ago·source ↗

Improving Verifiability in AI Development: Multi-Stakeholder Report

OpenAI contributed to a multi-stakeholder report co-authored by 58 researchers across 30 organizations, including Mila, CSET, and the Schwartz Reisman Institute. The report identifies 10 mechanisms for improving the verifiability of claims about AI systems. These tools are intended to help developers demonstrate safety, security, fairness, and privacy properties, while enabling policymakers and civil society to evaluate AI development processes.

Evaluation and Benchmarking AI Safety Research Centre for the Future of Intelligence Center for Security and Emerging Technology Mila +4 more

5arXiv · cs.LG·3d ago·source ↗

ReproRepo: Scalable LLM agent framework for reproducibility auditing using GitHub issues

ReproRepo is a new framework for evaluating LLM agents on reproducibility auditing of ML research, using naturally occurring GitHub issues as supervision signals rather than costly manual curation. The framework is instantiated on 1,149 recent ML papers from major conferences and benchmarks four frontier model-agent configurations. The best-performing agent (Codex with GPT-5.5) surfaces at least one semantically related human-reported reproduction blocker for ~90% of papers, though exact localization of issues remains a weakness. The work provides a reusable, scalable evaluation harness for this underexplored agentic task.

Evaluation and Benchmarking Agent and Tool Ecosystem OpenAI ReproRepo Codex +1 more

5arXiv · cs.AI·1mo ago·source ↗

Empirical Study of Quality and Security in AI-Generated Python Refactoring Pull Requests

Researchers conduct an empirical analysis of AI-agent-authored Python refactoring pull requests from the AIDev dataset, evaluating quality and security outcomes using PyQu, Pylint, and Bandit. Results show agentic commits improve a quality attribute in 22.5% of changes, while 24.17% of modified files introduce new Pylint issues and 4.7% introduce new Bandit security findings. Despite mixed quality outcomes, 73.5% of analyzed PRs are merged by developers. The study derives a taxonomy of 24 recurring change operations and argues for stronger tool-in-the-loop gating in AI-driven development workflows.

Evaluation and Benchmarking AI Safety Research PyQu GitHub Bandit +3 more

5arXiv · cs.CL·1mo ago·source ↗

AiraXiv: AI-Driven Open-Access Publishing Platform for Human and AI Scientists

AiraXiv is a proposed open-access academic publishing platform designed to accommodate both human and AI-generated research outputs, addressing scalability challenges in traditional peer review. The platform supports AI scientists via Model Context Protocol (MCP)-based interactions and human scientists through an interactive UI, with papers evolving through continuous feedback-driven iteration. It was validated through real-world deployment as the submission platform for ICAIS 2025. The work positions itself as infrastructure for a future where AI agents are first-class participants in the scientific publishing ecosystem.

Evaluation and Benchmarking Agent and Tool Ecosystem AiraXiv ArXiv ICAIS 2025 +1 more

6The Batch·8d ago·source ↗

The Batch explains recursive self-improvement hype following Anthropic's coding productivity report

The Batch analyzes the surge of interest in recursive self-improvement (RSI) triggered by Anthropic's report that Claude now authors or co-authors 80% of the company's code, up from under 5% before Claude Code launched. The piece documents concrete productivity metrics—engineers contributing 8x more code lines in Q2 2026 versus Q1 2023, and 800 API fixes shipped in April that would have taken humans four years alone—alongside a spectrum of community reactions ranging from skeptical (Brundage, Mollick) to opportunistic (OpenAI, Sakana AI's new RSI Lab). The commentary frames RSI as theoretically distant but notes the marketing dimension of Anthropic's framing and the gap between agentic coding assistance and true self-directed improvement.

Frontier Model Releases AI Safety Research Ethan Mollick Sakana AI Meta-Agent Challenge +10 more