Almanac
benchmark

WebArena

benchmarkactivewebarena-0c77f55c·1 events·first seen 24d ago

Aliases: WebArena

Co-occurring entities

More like this (12)

Recent events (1)

6The Batch·24d ago·source ↗

Agent Benchmarks Skew Toward Software Engineering, Missing Most Economically Valuable Labor

Researchers from Carnegie Mellon University and Stanford University mapped over 10,000 examples from 43 agent benchmarks to U.S. labor statistics using O*NET occupational taxonomies, finding that current benchmarks heavily over-represent software engineering relative to its share of employment and wages. Office and administrative support (18.2M workers, $869.8B wages) and management (11M workers, $1326.3B wages) are vastly under-represented compared to computer and mathematical occupations (5.2M workers, $563.6B wages). No single benchmark covered more than 50% of work activities, and all 43 benchmarks combined covered only 56.5% of work activities. The study identifies a systematic gap between where agentic AI is being evaluated and where the largest economic opportunity lies.