ToolBench-X
toolbench-x-2b2c80e0·1 events·first seen 9h agoAliases: ToolBench-X
Co-occurring entities
More like this (12)
Recent events (1)
ToolBench-X benchmarks LLM agents under tool-environment unreliability
A new arXiv preprint introduces ToolBench-X, a benchmark for evaluating LLM agents under five structured hazard types including Specification Drift, Invocation Error, Execution Failure, Output Drift, and Cross-source Conflict. Each injected hazard remains solvable via recovery paths such as retrying, fallback, or cross-checking, enabling measurement of agent resilience rather than just function-call accuracy. Experiments reveal a substantial reliability gap: agents that perform well in clean environments frequently fail under recoverable hazards, with failures driven by poor hazard diagnosis rather than insufficient tool-use volume or inference budget. The findings argue for shifting tool-use evaluation toward task completion under realistic, unreliable conditions.