OpenAI Introduces GDPval: Evaluation of Model Performance on Economically Valuable Real-World Tasks
OpenAI has released GDPval, a new benchmark designed to measure AI model performance on real-world economically valuable tasks spanning 44 occupations. The evaluation aims to move beyond traditional academic benchmarks by grounding model assessment in tasks with direct economic relevance. This represents OpenAI's effort to better quantify the practical utility and labor-market impact of frontier models.
Related guides (3)
Related events (8)
OpenAI Releases Economic Analysis of ChatGPT's Impact and Launches Labor Market Research Collaboration
OpenAI has published an economic analysis examining ChatGPT's impact on the broader economy. Alongside this, the company is launching a new research collaboration focused on studying AI's effects on labor markets and productivity. The initiative signals OpenAI's growing engagement with economic and workforce policy questions as scrutiny of AI's labor displacement effects intensifies.
OpenAI launches Economic Research Exchange to study AI's labor and productivity impacts
OpenAI has announced the Economic Research Exchange, a program to fund and facilitate external research on AI's effects on jobs, productivity, and the broader economy. Applications are open for selected research projects. The initiative signals OpenAI's interest in shaping the empirical narrative around AI's economic consequences.
Introducing HealthBench
OpenAI has released HealthBench, a new evaluation benchmark designed to assess AI model performance and safety in healthcare settings. The benchmark was developed with input from over 250 physicians and targets realistic clinical scenarios. It aims to establish a shared standard for measuring how well AI models handle health-related tasks.
Measuring AI's capability to accelerate biological research
OpenAI introduces a real-world evaluation framework designed to measure how AI systems can accelerate biological research in wet lab settings. The work uses GPT-5 to optimize a molecular cloning protocol as a concrete demonstration case. The framework explicitly addresses both the potential benefits and biosecurity risks of AI-assisted experimentation, positioning this as a dual-use capability assessment.
OpenAI Releases GABRIEL: Open-Source Toolkit for AI-Assisted Social Science Research
OpenAI has released GABRIEL, an open-source toolkit that leverages GPT models to convert qualitative text and images into quantitative data for social science research. The tool is designed to help researchers analyze large-scale qualitative datasets that would otherwise be impractical to process manually. It represents an application of frontier LLMs to academic research methodology rather than a new model or capability announcement.
Open-world evaluations for measuring frontier AI capabilities: Introducing CRUX
This commentary introduces CRUX, a new evaluation project designed to assess frontier AI systems on long-horizon, open-ended, and messy real-world tasks. The piece argues that existing benchmarks are insufficient for capturing the full range of capabilities exhibited by frontier models in complex settings. CRUX aims to fill this gap by providing evaluations that better reflect deployment-relevant performance.
AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality
IBM Research introduces AssetOpsBench, a benchmark designed to evaluate AI agents on industrial asset operations tasks, hosted on Hugging Face. The benchmark targets the gap between existing general-purpose agent benchmarks and real-world industrial deployment scenarios. It provides a playground environment for testing agent capabilities in enterprise/industrial contexts.
Rethinking how we measure AI intelligence
DeepMind has announced Game Arena, a new open-source evaluation platform designed for rigorous head-to-head comparison of frontier AI models. The platform uses environments with clear winning conditions to assess model capabilities. This represents DeepMind's contribution to addressing ongoing concerns about the adequacy of existing AI benchmarks.


