Entity · benchmark

GDPval

benchmarkactivegdpval-bf84da34·3 events·first seen May 20, 2026

Aliases: GDPval

Co-occurring entities

More like this (12)

GDPval-AA GPQA DPG Benchmark SGD GPIC benchmark tcGP EG-VQA GPQA Diamond IVGT QVTo G-Eval CVaR (Conditional Value at Risk)

Recent events (3)

7The Batch·Jun 3, 2026·source ↗

Data Points: GPT-5.4 Pro, Luma Uni-1, Phi-4-reasoning-vision-15B, Yuan 3.0 Ultra, OpenAI hardware chief resignation

The Batch's weekly roundup covers several significant AI developments: OpenAI released GPT-5.4 and GPT-5.4 Pro with computer-use agent capabilities, 1M token context, and strong benchmark gains on GDPval and OSWorld-Verified; Luma AI released Uni-1, a unified autoregressive model for visual understanding and generation; Microsoft released Phi-4-reasoning-vision-15B, an open-weights multimodal model trained on 200B tokens; Yuan Lab AI released Yuan 3.0 Ultra, a 1T-parameter MoE model with SOTA on document retrieval benchmarks. Additionally, OpenAI hardware chief Caitlin Kalinowski resigned over the company's Pentagon deal, citing concerns about surveillance and autonomous weapons governance.

Frontier Model Releases Open Weights Progress Black Forest Labs Layer-Adaptive Expert Pruning Caitlin Kalinowski +19 more

6The Batch·May 23, 2026·source ↗

Agent Benchmarks Skew Toward Software Engineering, Missing Most Economically Valuable Labor

Researchers from Carnegie Mellon University and Stanford University mapped over 10,000 examples from 43 agent benchmarks to U.S. labor statistics using O*NET occupational taxonomies, finding that current benchmarks heavily over-represent software engineering relative to its share of employment and wages. Office and administrative support (18.2M workers, $869.8B wages) and management (11M workers, $1326.3B wages) are vastly under-represented compared to computer and mathematical occupations (5.2M workers, $563.6B wages). No single benchmark covered more than 50% of work activities, and all 43 benchmarks combined covered only 56.5% of work activities. The study identifies a systematic gap between where agentic AI is being evaluated and where the largest economic opportunity lies.

Evaluation and Benchmarking Enterprise Deployment Patterns Carnegie Mellon University GDPval Stanford University +7 more

7Openai Blog·May 20, 2026·source ↗

OpenAI Introduces GDPval: Evaluation of Model Performance on Economically Valuable Real-World Tasks

OpenAI has released GDPval, a new benchmark designed to measure AI model performance on real-world economically valuable tasks spanning 44 occupations. The evaluation aims to move beyond traditional academic benchmarks by grounding model assessment in tasks with direct economic relevance. This represents OpenAI's effort to better quantify the practical utility and labor-market impact of frontier models.

Evaluation and Benchmarking Enterprise Deployment Patterns GDPval OpenAI