benchmark
AGORA
benchmarkactiveprovisional
agora-090180b7·1 events·first seen 4d agoAliases: AGORA
More like this (12)
Recent events (1)
AGORA benchmark tests agentic document reasoning over large authentic workplace archives
Researchers introduce AGORA, a benchmark pairing 362 questions with 9,664 authentic workplace documents (372M tokens across eight domain collections) to evaluate archive-grounded agentic reasoning. The benchmark is designed so documents far exceed any model's context window, forcing deliberate exploration rather than exhaustive scanning. Evaluating eight models, the best achieves only 59.4% accuracy, indicating the task is far from solved. The benchmark addresses a gap in existing evals that do not jointly stress archive-groundedness, agentic exploration, and cross-domain coverage.