Almanac
benchmark

ToolBench-X

benchmarkactiveprovisionaltoolbench-x-2b2c80e0·1 events·first seen 9h ago

Aliases: ToolBench-X

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·9h ago·source ↗

ToolBench-X benchmarks LLM agents under tool-environment unreliability

A new arXiv preprint introduces ToolBench-X, a benchmark for evaluating LLM agents under five structured hazard types including Specification Drift, Invocation Error, Execution Failure, Output Drift, and Cross-source Conflict. Each injected hazard remains solvable via recovery paths such as retrying, fallback, or cross-checking, enabling measurement of agent resilience rather than just function-call accuracy. Experiments reveal a substantial reliability gap: agents that perform well in clean environments frequently fail under recoverable hazards, with failures driven by poor hazard diagnosis rather than insufficient tool-use volume or inference budget. The findings argue for shifting tool-use evaluation toward task completion under realistic, unreliable conditions.