Entity · benchmark

AARRI-Bench

benchmarkactiveaarri-bench-4d4e5a32·1 events·first seen Jun 8, 2026

Aliases: AARRI-Bench

Co-occurring entities

Claude Opus 4.6 SWE-bench Mini-SWE-Agent Anthropic

More like this (12)

EARBench ATE-Bench RRBench APS-Bench RIO-Bench AdvBench RepoBench AdversaBench VR-Bench SorryBench ALE-Bench Int-Bench

Recent events (1)

6arXiv · cs.AI·Jun 8, 2026·source ↗

AARRI-Bench evaluates frontier LLMs and agents on granular research-intern-level tasks

Researchers introduce AARR (Act As a Real Researcher), a new benchmark series targeting whether AI agents can emulate the professionalism, thoroughness, and nuanced judgment of human researchers in granular research scenarios—not just macro-level task execution. The first benchmark, AARRI-Bench, tests frontier models and agentic harnesses, finding that even the best configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3% success, frequently missing subtle but critical details obvious to human researchers. The work argues that closing the gap requires deeper modeling of research behavior rather than more complex scaffolding.

Evaluation and Benchmarking Agent and Tool Ecosystem Claude Opus 4.6 SWE-bench AARRI-Bench +2 more