Almanac
benchmark

AARRI-Bench

benchmarkactiveprovisionalaarri-bench-4d4e5a32·1 events·first seen 9d ago

Aliases: AARRI-Bench

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.AI·9d ago·source ↗

AARRI-Bench evaluates frontier LLMs and agents on granular research-intern-level tasks

Researchers introduce AARR (Act As a Real Researcher), a new benchmark series targeting whether AI agents can emulate the professionalism, thoroughness, and nuanced judgment of human researchers in granular research scenarios—not just macro-level task execution. The first benchmark, AARRI-Bench, tests frontier models and agentic harnesses, finding that even the best configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3% success, frequently missing subtle but critical details obvious to human researchers. The work argues that closing the gap requires deeper modeling of research behavior rather than more complex scaffolding.