Pacific Northwest National Laboratory and OpenAI Partner to Accelerate Federal Permitting with DraftNEPABench
OpenAI and Pacific Northwest National Laboratory (PNNL) have introduced DraftNEPABench, a benchmark designed to evaluate AI coding agents on federal permitting tasks under the National Environmental Policy Act (NEPA). The benchmark demonstrates potential to reduce NEPA document drafting time by up to 15%. The collaboration targets modernization of infrastructure review processes through AI-assisted automation.
Related guides (4)
Related events (8)
PaperBench: OpenAI Benchmark for Evaluating AI Agents on Research Replication
OpenAI introduces PaperBench, a benchmark designed to evaluate AI agents' ability to replicate state-of-the-art AI research papers end-to-end. The benchmark targets a high-complexity capability: reproducing experimental results from frontier AI research, which requires code generation, experimental design, and scientific reasoning. This positions PaperBench as a tool for tracking progress toward autonomous AI research agents.
OpenAI and Los Alamos National Laboratory Announce Research Partnership on Biosafety Evaluations
OpenAI and Los Alamos National Laboratory (LANL) have announced a research partnership focused on developing safety evaluations for frontier AI models. The collaboration specifically targets assessing and measuring biological capabilities and risks. LANL brings national-lab-level biosecurity expertise to the effort, which aligns with OpenAI's broader preparedness framework for catastrophic risk domains.
Introducing HealthBench
OpenAI has released HealthBench, a new evaluation benchmark designed to assess AI model performance and safety in healthcare settings. The benchmark was developed with input from over 250 physicians and targets realistic clinical scenarios. It aims to establish a shared standard for measuring how well AI models handle health-related tasks.
OpenAI proposes federal governance blueprint for frontier AI safety and national security
OpenAI published a policy blueprint calling for a U.S. federal framework to govern frontier AI, covering safety, resilience, and national security dimensions. The proposal outlines OpenAI's vision for democratic oversight of the most capable AI systems. As a tier-1 primary source from a leading lab, this represents a significant public policy position that will likely influence regulatory discussions.
OpenAI introduces LifeSciBench, a life sciences AI evaluation benchmark
OpenAI has released LifeSciBench, a benchmark designed to evaluate AI systems on real-world life science research tasks and decisions. The benchmark is described as expert-authored and expert-reviewed, targeting domain-specific evaluation in biology and related fields. This addresses a gap in specialized scientific benchmarking for AI systems.
Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)
The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.
Strengthening America's AI Leadership with the U.S. National Laboratories
OpenAI has announced a partnership to deploy its latest reasoning models with U.S. National Laboratories, giving the nation's leading scientists access to frontier AI capabilities for scientific research. The collaboration positions OpenAI's reasoning model line as a tool for high-stakes government scientific work. This represents a significant enterprise and government deployment milestone for OpenAI's o-series reasoning models.
OpenAI Introduces FrontierScience Benchmark for Scientific Research Tasks
OpenAI has released FrontierScience, a new benchmark designed to evaluate AI reasoning capabilities across physics, chemistry, and biology. The benchmark is intended to measure progress toward AI systems capable of performing real scientific research tasks. This represents OpenAI's effort to establish a rigorous evaluation framework for frontier-level scientific reasoning, going beyond standard academic problem sets.



