benchmark
CRUX
benchmarkactive
crux-86a82331·1 events·first seen 1mo agoAliases: CRUX
Co-occurring entities
More like this (12)
Recent events (1)
Open-world evaluations for measuring frontier AI capabilities: Introducing CRUX
This commentary introduces CRUX, a new evaluation project designed to assess frontier AI systems on long-horizon, open-ended, and messy real-world tasks. The piece argues that existing benchmarks are insufficient for capturing the full range of capabilities exhibited by frontier models in complex settings. CRUX aims to fill this gap by providing evaluations that better reflect deployment-relevant performance.