Almanac
benchmark

AGORA

benchmarkactiveprovisionalagora-090180b7·1 events·first seen 4d ago

Aliases: AGORA

More like this (12)

Recent events (1)

6arXiv · cs.CL·4d ago·source ↗

AGORA benchmark tests agentic document reasoning over large authentic workplace archives

Researchers introduce AGORA, a benchmark pairing 362 questions with 9,664 authentic workplace documents (372M tokens across eight domain collections) to evaluate archive-grounded agentic reasoning. The benchmark is designed so documents far exceed any model's context window, forcing deliberate exploration rather than exhaustive scanning. Evaluating eight models, the best achieves only 59.4% accuracy, indicating the task is far from solved. The benchmark addresses a gap in existing evals that do not jointly stress archive-groundedness, agentic exploration, and cross-domain coverage.