benchmark
PlanBench-XL
benchmarkactiveprovisional
planbench-xl-f91ee107·1 events·first seen 8h agoAliases: PlanBench-XL
Co-occurring entities
More like this (12)
Recent events (1)
PlanBench-XL: Benchmark for LLM Agent Planning in Large-Scale Tool Ecosystems
Researchers introduce PlanBench-XL, an interactive benchmark of 327 retail tasks spanning 1,665 tools designed to evaluate LLM agents on long-horizon planning under retrieval-limited tool visibility. The benchmark includes a blocking mechanism simulating real-world disruptions such as missing or failing tools, forcing agents to detect and recover from broken execution paths. Experiments on ten leading LLMs reveal severe performance degradation: GPT-5.4 drops from 51.90% accuracy in unblocked settings to 11.36% under the most severe blocking condition, highlighting fragility in adaptive planning for large, imperfect tool environments.