benchmark
T1-Bench
benchmarkactiveprovisional
t1-bench-5fc3bcd7·1 events·first seen 7d agoAliases: T1-Bench
More like this (12)
Recent events (1)
T1-Bench: Multi-scenario agent benchmark across 25 real-world domains
T1-Bench is a new benchmark for evaluating agentic LLM systems in realistic customer-facing, multi-domain environments, covering 25 domains of varying difficulty with interleaved multi-turn scenarios. The authors evaluate 12 proprietary and open-weight models and combine automatic evaluation with human judgments. The benchmark targets gaps in existing agent evals around task complexity, domain diversity, and compositional reasoning across multi-step interactions.