Almanac
benchmark

T1-Bench

benchmarkactiveprovisionalt1-bench-5fc3bcd7·1 events·first seen 7d ago

Aliases: T1-Bench

More like this (12)

Recent events (1)

4arXiv · cs.CL·7d ago·source ↗

T1-Bench: Multi-scenario agent benchmark across 25 real-world domains

T1-Bench is a new benchmark for evaluating agentic LLM systems in realistic customer-facing, multi-domain environments, covering 25 domains of varying difficulty with interleaved multi-turn scenarios. The authors evaluate 12 proprietary and open-weight models and combine automatic evaluation with human judgments. The benchmark targets gaps in existing agent evals around task complexity, domain diversity, and compositional reasoning across multi-step interactions.