benchmark

PlanBench-XL

benchmarkactiveprovisionalplanbench-xl-f91ee107·1 events·first seen 8h ago

Aliases: PlanBench-XL

Co-occurring entities

More like this (12)

SupraBench LiveBench ProgramBench BixBench LabBench DeliveryBench RoleBench LongBench v2 PaperBench AdvBench SPBench RepoBench

Recent events (1)

6arXiv · cs.CL·8h ago·source ↗

PlanBench-XL: Benchmark for LLM Agent Planning in Large-Scale Tool Ecosystems

Researchers introduce PlanBench-XL, an interactive benchmark of 327 retail tasks spanning 1,665 tools designed to evaluate LLM agents on long-horizon planning under retrieval-limited tool visibility. The benchmark includes a blocking mechanism simulating real-world disruptions such as missing or failing tools, forcing agents to detect and recover from broken execution paths. Experiments on ten leading LLMs reveal severe performance degradation: GPT-5.4 drops from 51.90% accuracy in unblocked settings to 11.36% under the most severe blocking condition, highlighting fragility in adaptive planning for large, imperfect tool environments.

Evaluation and Benchmarking Agent and Tool Ecosystem OpenAI PlanBench-XL GPT-5.5