Almanac
benchmark

AutoLab

benchmarkactiveprovisionalautolab-31e8ec0f·1 events·first seen 13d ago

Aliases: AutoLab

Co-occurring entities

More like this (12)

Recent events (1)

7arXiv · cs.AI·13d ago·source ↗

AutoLab benchmark evaluates frontier models on ultra long-horizon iterative research and engineering tasks

AutoLab is a new benchmark of 36 expert-curated tasks across system optimization, puzzle-solving, model development, and CUDA kernel optimization, designed to test agents on sustained closed-loop improvement under wall-clock budgets rather than single-turn or short-horizon settings. Evaluation of 17 frontier models finds that persistence in iterative benchmarking and feedback incorporation — not initial attempt quality — is the dominant success predictor. Claude Opus 4.6 stands out as the strongest performer, while most models including proprietary ones either terminate early or exhaust budgets with minimal progress. The benchmark, harness, and task artifacts are open-sourced.