Entity · benchmark

T1-Bench

benchmarkactivet1-bench-5fc3bcd7·1 events·first seen Jun 10, 2026

Aliases: T1-Bench

More like this (12)

T3Bench T2I-CompBench TriggerBench ITBench-AA ATE-Bench SelectBench IT-Bench SorryBench τ²-Bench MTBench Int-Bench TAU-bench

Recent events (1)

4arXiv · cs.CL·Jun 10, 2026·source ↗

T1-Bench: Multi-scenario agent benchmark across 25 real-world domains

T1-Bench is a new benchmark for evaluating agentic LLM systems in realistic customer-facing, multi-domain environments, covering 25 domains of varying difficulty with interleaved multi-turn scenarios. The authors evaluate 12 proprietary and open-weight models and combine automatic evaluation with human judgments. The benchmark targets gaps in existing agent evals around task complexity, domain diversity, and compositional reasoning across multi-step interactions.

Evaluation and Benchmarking Agent and Tool Ecosystem T1-Bench