Decade-long analysis of 56,800 AI conference papers finds sixfold increase in code/data sharing
A new arXiv preprint analyzes documentation and reproducibility practices across 56,800 papers from five leading AI conferences between 2014 and 2024. Code and data sharing rose nearly sixfold from 11% to 64%, with estimated reproducibility increasing from 28% to 64% over the same period. Notably, improvements in documentation practices predate the introduction of formal reproducibility checklists, suggesting the shift reflects a broader open-science movement rather than compliance with venue requirements.
Related guides (1)
Related events (8)
Can AI automate computational reproducibility?
This commentary introduces a new benchmark aimed at measuring AI's ability to automate computational reproducibility in scientific research. The piece examines whether AI systems can reliably re-execute and validate scientific computations, a key bottleneck in research integrity. It frames reproducibility automation as a concrete, measurable capability for evaluating AI's impact on science.
Taxonomy and governance gap analysis for AI contributors in open-source software
A preprint from arXiv analyzes how open-source organizations are handling AI-generated and agent-driven contributions, comparing policies across six major projects (SymPy, LLVM, matplotlib, OpenInfra, Apache Software Foundation, Linux Foundation). The authors develop a six-dimensional taxonomy covering disclosure, responsibility, human oversight, licensing, enforcement, and maintainer workload, and score each organization's policy maturity. The paper maps documented agent incidents onto governance gaps and identifies misalignments with emerging regulatory frameworks including the EU AI Act, NIST AI RMF, and ISO/IEC 42001, proposing a harmonized tiered framework.
AI and Compute: OpenAI Analysis of Exponential Growth in Training Compute Since 2012
OpenAI published an analysis in May 2018 showing that compute used in the largest AI training runs has been doubling every 3.4 months since 2012, far outpacing Moore's Law's 2-year doubling period. Over the 2012–2018 period, this metric grew by more than 300,000x. The analysis frames compute scaling as a key driver of AI progress and argues for preparing for systems with capabilities well beyond those of the time.
Improving Verifiability in AI Development: Multi-Stakeholder Report
OpenAI contributed to a multi-stakeholder report co-authored by 58 researchers across 30 organizations, including Mila, CSET, and the Schwartz Reisman Institute. The report identifies 10 mechanisms for improving the verifiability of claims about AI systems. These tools are intended to help developers demonstrate safety, security, fairness, and privacy properties, while enabling policymakers and civil society to evaluate AI development processes.
ReproRepo: Scalable LLM agent framework for reproducibility auditing using GitHub issues
ReproRepo is a new framework for evaluating LLM agents on reproducibility auditing of ML research, using naturally occurring GitHub issues as supervision signals rather than costly manual curation. The framework is instantiated on 1,149 recent ML papers from major conferences and benchmarks four frontier model-agent configurations. The best-performing agent (Codex with GPT-5.5) surfaces at least one semantically related human-reported reproduction blocker for ~90% of papers, though exact localization of issues remains a weakness. The work provides a reusable, scalable evaluation harness for this underexplored agentic task.
Empirical Study of Quality and Security in AI-Generated Python Refactoring Pull Requests
Researchers conduct an empirical analysis of AI-agent-authored Python refactoring pull requests from the AIDev dataset, evaluating quality and security outcomes using PyQu, Pylint, and Bandit. Results show agentic commits improve a quality attribute in 22.5% of changes, while 24.17% of modified files introduce new Pylint issues and 4.7% introduce new Bandit security findings. Despite mixed quality outcomes, 73.5% of analyzed PRs are merged by developers. The study derives a taxonomy of 24 recurring change operations and argues for stronger tool-in-the-loop gating in AI-driven development workflows.
AiraXiv: AI-Driven Open-Access Publishing Platform for Human and AI Scientists
AiraXiv is a proposed open-access academic publishing platform designed to accommodate both human and AI-generated research outputs, addressing scalability challenges in traditional peer review. The platform supports AI scientists via Model Context Protocol (MCP)-based interactions and human scientists through an interactive UI, with papers evolving through continuous feedback-driven iteration. It was validated through real-world deployment as the submission platform for ICAIS 2025. The work positions itself as infrastructure for a future where AI agents are first-class participants in the scientific publishing ecosystem.
The Batch explains recursive self-improvement hype following Anthropic's coding productivity report
The Batch analyzes the surge of interest in recursive self-improvement (RSI) triggered by Anthropic's report that Claude now authors or co-authors 80% of the company's code, up from under 5% before Claude Code launched. The piece documents concrete productivity metrics—engineers contributing 8x more code lines in Q2 2026 versus Q1 2023, and 800 API fixes shipped in April that would have taken humans four years alone—alongside a spectrum of community reactions ranging from skeptical (Brundage, Mollick) to opportunistic (OpenAI, Sakana AI's new RSI Lab). The commentary frames RSI as theoretically distant but notes the marketing dimension of Anthropic's framing and the gap between agentic coding assistance and true self-directed improvement.
