7OpenAI Blog·1mo ago

OpenAI Introduces GDPval: Evaluation of Model Performance on Economically Valuable Real-World Tasks

OpenAI has released GDPval, a new benchmark designed to measure AI model performance on real-world economically valuable tasks spanning 44 occupations. The evaluation aims to move beyond traditional academic benchmarks by grounding model assessment in tasks with direct economic relevance. This represents OpenAI's effort to better quantify the practical utility and labor-market impact of frontier models.

Evaluation and Benchmarking Enterprise Deployment Patterns GDPval OpenAI

Related guides (3)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5Openai Blog·1mo ago·source ↗

OpenAI Releases Economic Analysis of ChatGPT's Impact and Launches Labor Market Research Collaboration

OpenAI has published an economic analysis examining ChatGPT's impact on the broader economy. Alongside this, the company is launching a new research collaboration focused on studying AI's effects on labor markets and productivity. The initiative signals OpenAI's growing engagement with economic and workforce policy questions as scrutiny of AI's labor displacement effects intensifies.

Enterprise Deployment Patterns Regulatory Developments ChatGPT OpenAI

5Openai Blog·12d ago·source ↗

OpenAI launches Economic Research Exchange to study AI's labor and productivity impacts

OpenAI has announced the Economic Research Exchange, a program to fund and facilitate external research on AI's effects on jobs, productivity, and the broader economy. Applications are open for selected research projects. The initiative signals OpenAI's interest in shaping the empirical narrative around AI's economic consequences.

AI Safety Research Regulatory Developments OpenAI Economic Research Exchange OpenAI

6Openai Blog·1mo ago·source ↗

Introducing HealthBench

OpenAI has released HealthBench, a new evaluation benchmark designed to assess AI model performance and safety in healthcare settings. The benchmark was developed with input from over 250 physicians and targets realistic clinical scenarios. It aims to establish a shared standard for measuring how well AI models handle health-related tasks.

Evaluation and Benchmarking AI Safety Research HealthBench OpenAI +1 more

8Openai Blog·1mo ago·source ↗

Measuring AI's capability to accelerate biological research

OpenAI introduces a real-world evaluation framework designed to measure how AI systems can accelerate biological research in wet lab settings. The work uses GPT-5 to optimize a molecular cloning protocol as a concrete demonstration case. The framework explicitly addresses both the potential benefits and biosecurity risks of AI-assisted experimentation, positioning this as a dual-use capability assessment.

Frontier Model Releases Evaluation and Benchmarking wet lab biological research evaluation framework OpenAI molecular cloning +3 more

4Openai Blog·1mo ago·source ↗

OpenAI Releases GABRIEL: Open-Source Toolkit for AI-Assisted Social Science Research

OpenAI has released GABRIEL, an open-source toolkit that leverages GPT models to convert qualitative text and images into quantitative data for social science research. The tool is designed to help researchers analyze large-scale qualitative datasets that would otherwise be impractical to process manually. It represents an application of frontier LLMs to academic research methodology rather than a new model or capability announcement.

Enterprise Deployment Patterns Agent and Tool Ecosystem GABRIEL OpenAI GPT

5Ai Snake Oil·1mo ago·source ↗

Open-world evaluations for measuring frontier AI capabilities: Introducing CRUX

This commentary introduces CRUX, a new evaluation project designed to assess frontier AI systems on long-horizon, open-ended, and messy real-world tasks. The piece argues that existing benchmarks are insufficient for capturing the full range of capabilities exhibited by frontier models in complex settings. CRUX aims to fill this gap by providing evaluations that better reflect deployment-relevant performance.

Frontier Model Releases Evaluation and Benchmarking Normal Tech CRUX AI Snake Oil +1 more

4Hugging Face Blog·1mo ago·source ↗

AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality

IBM Research introduces AssetOpsBench, a benchmark designed to evaluate AI agents on industrial asset operations tasks, hosted on Hugging Face. The benchmark targets the gap between existing general-purpose agent benchmarks and real-world industrial deployment scenarios. It provides a playground environment for testing agent capabilities in enterprise/industrial contexts.

Evaluation and Benchmarking Enterprise Deployment Patterns IBM Research AssetOpsBench Hugging Face +1 more

6Google Deepmind Blog·1mo ago·source ↗

Rethinking how we measure AI intelligence

DeepMind has announced Game Arena, a new open-source evaluation platform designed for rigorous head-to-head comparison of frontier AI models. The platform uses environments with clear winning conditions to assess model capabilities. This represents DeepMind's contribution to addressing ongoing concerns about the adequacy of existing AI benchmarks.

Frontier Model Releases Evaluation and Benchmarking Game Arena DeepMind