Entity · benchmark

AutoLab

benchmarkactiveautolab-31e8ec0f·1 events·first seen Jun 4, 2026

Aliases: AutoLab

Co-occurring entities

More like this (12)

LabBench AutoMem Analytics-Everywhere-Lab AutomationBench-AA FedLAB TraceLab iLearn-Lab Lambda Labs aiming-lab AutoJourn OpenDataLab Yuan Lab AI

Recent events (1)

7arXiv · cs.AI·Jun 4, 2026·source ↗

AutoLab benchmark evaluates frontier models on ultra long-horizon iterative research and engineering tasks

AutoLab is a new benchmark of 36 expert-curated tasks across system optimization, puzzle-solving, model development, and CUDA kernel optimization, designed to test agents on sustained closed-loop improvement under wall-clock budgets rather than single-turn or short-horizon settings. Evaluation of 17 frontier models finds that persistence in iterative benchmarking and feedback incorporation — not initial attempt quality — is the dominant success predictor. Claude Opus 4.6 stands out as the strongest performer, while most models including proprietary ones either terminate early or exhaust budgets with minimal progress. The benchmark, harness, and task artifacts are open-sourced.

Frontier Model Releases Evaluation and Benchmarking Claude Opus 4.6 AutoLab Anthropic +1 more