6OpenAI Blog·3d ago

OpenAI introduces LifeSciBench, a life sciences AI evaluation benchmark

OpenAI has released LifeSciBench, a benchmark designed to evaluate AI systems on real-world life science research tasks and decisions. The benchmark is described as expert-authored and expert-reviewed, targeting domain-specific evaluation in biology and related fields. This addresses a gap in specialized scientific benchmarking for AI systems.

Evaluation and Benchmarking LifeSciBench OpenAI

Related guides (2)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6Openai Blog·1mo ago·source ↗

Introducing HealthBench

OpenAI has released HealthBench, a new evaluation benchmark designed to assess AI model performance and safety in healthcare settings. The benchmark was developed with input from over 250 physicians and targets realistic clinical scenarios. It aims to establish a shared standard for measuring how well AI models handle health-related tasks.

Evaluation and Benchmarking AI Safety Research HealthBench OpenAI +1 more

7Openai Blog·1mo ago·source ↗

OpenAI Introduces FrontierScience Benchmark for Scientific Research Tasks

OpenAI has released FrontierScience, a new benchmark designed to evaluate AI reasoning capabilities across physics, chemistry, and biology. The benchmark is intended to measure progress toward AI systems capable of performing real scientific research tasks. This represents OpenAI's effort to establish a rigorous evaluation framework for frontier-level scientific reasoning, going beyond standard academic problem sets.

Frontier Model Releases Evaluation and Benchmarking physics biology FrontierScience +3 more

7Openai Blog·1mo ago·source ↗

PaperBench: OpenAI Benchmark for Evaluating AI Agents on Research Replication

OpenAI introduces PaperBench, a benchmark designed to evaluate AI agents' ability to replicate state-of-the-art AI research papers end-to-end. The benchmark targets a high-complexity capability: reproducing experimental results from frontier AI research, which requires code generation, experimental design, and scientific reasoning. This positions PaperBench as a tool for tracking progress toward autonomous AI research agents.

Evaluation and Benchmarking AI Safety Research OpenAI PaperBench +1 more

8Openai Blog·1mo ago·source ↗

Measuring AI's capability to accelerate biological research

OpenAI introduces a real-world evaluation framework designed to measure how AI systems can accelerate biological research in wet lab settings. The work uses GPT-5 to optimize a molecular cloning protocol as a concrete demonstration case. The framework explicitly addresses both the potential benefits and biosecurity risks of AI-assisted experimentation, positioning this as a dual-use capability assessment.

Frontier Model Releases Evaluation and Benchmarking wet lab biological research evaluation framework OpenAI molecular cloning +3 more

6Openai Blog·1mo ago·source ↗

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

OpenAI introduces MLE-bench, a benchmark designed to measure AI agent performance on machine learning engineering tasks. The benchmark draws from Kaggle competitions to evaluate agents on realistic ML engineering workflows. Initial results show that current agents, including those powered by o1-preview, achieve competitive performance on a subset of tasks but fall well short of top human competitors. The benchmark is intended to track progress in agentic ML capabilities over time.

Frontier Model Releases Evaluation and Benchmarking Kaggle o1-preview MLE-bench +2 more

4Hugging Face Blog·1mo ago·source ↗

AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality

IBM Research introduces AssetOpsBench, a benchmark designed to evaluate AI agents on industrial asset operations tasks, hosted on Hugging Face. The benchmark targets the gap between existing general-purpose agent benchmarks and real-world industrial deployment scenarios. It provides a playground environment for testing agent capabilities in enterprise/industrial contexts.

Evaluation and Benchmarking Enterprise Deployment Patterns IBM Research AssetOpsBench Hugging Face +1 more

6Google Deepmind Blog·1mo ago·source ↗

Rethinking how we measure AI intelligence

DeepMind has announced Game Arena, a new open-source evaluation platform designed for rigorous head-to-head comparison of frontier AI models. The platform uses environments with clear winning conditions to assess model capabilities. This represents DeepMind's contribution to addressing ongoing concerns about the adequacy of existing AI benchmarks.

Frontier Model Releases Evaluation and Benchmarking Game Arena DeepMind

7arXiv · cs.CL·25d ago·source ↗

Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)

The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.

Frontier Model Releases Evaluation and Benchmarking NeurIPS Auto Benchmark Audit (ABA)SWE-Bench Verified +2 more