Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability
beyond-function-calling-benchmarking-tool-using-agents-under-tool-environment-unreliability-29e815d5·1 events·first seen 10h agoAliases: Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability
Co-occurring entities
More like this (12)
Recent events (1)
ToolBench-X benchmarks LLM agents under tool-environment unreliability
A new arXiv preprint introduces ToolBench-X, a benchmark for evaluating LLM agents under five structured hazard types including Specification Drift, Invocation Error, Execution Failure, Output Drift, and Cross-source Conflict. Each injected hazard remains solvable via recovery paths such as retrying, fallback, or cross-checking, enabling measurement of agent resilience rather than just function-call accuracy. Experiments reveal a substantial reliability gap: agents that perform well in clean environments frequently fail under recoverable hazards, with failures driven by poor hazard diagnosis rather than insufficient tool-use volume or inference budget. The findings argue for shifting tool-use evaluation toward task completion under realistic, unreliable conditions.