Almanac
paper

Benchmark Everything Everywhere All at Once

paperactiveprovisionalbenchmark-everything-everywhere-all-at-once-d51c8bea·1 events·first seen 12d ago

Aliases: Benchmark Everything Everywhere All at Once

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.AI·12d ago·source ↗

Benchmark Agent: Autonomous system for end-to-end benchmark construction

Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.