paper

Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability

paperactiveprovisionalbeyond-function-calling-benchmarking-tool-using-agents-under-tool-environment-unreliability-29e815d5·1 events·first seen 10h ago

Aliases: Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability

Co-occurring entities

ToolBench-X

More like this (12)

Super-Agent benchmark tool-calling agents MemoryAgentBench multi-turn agent benchmarks Benchmark Agent Towards a Science of AI Agent Reliability AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility Auto Benchmark Audit (ABA)AI Reproducibility Benchmark temporally grounded QA benchmark Procgen Benchmark multi-level agent evaluation

Recent events (1)

6arXiv · cs.CL·10h ago·source ↗

ToolBench-X benchmarks LLM agents under tool-environment unreliability

A new arXiv preprint introduces ToolBench-X, a benchmark for evaluating LLM agents under five structured hazard types including Specification Drift, Invocation Error, Execution Failure, Output Drift, and Cross-source Conflict. Each injected hazard remains solvable via recovery paths such as retrying, fallback, or cross-checking, enabling measurement of agent resilience rather than just function-call accuracy. Experiments reveal a substantial reliability gap: agents that perform well in clean environments frequently fail under recoverable hazards, with failures driven by poor hazard diagnosis rather than insufficient tool-use volume or inference budget. The findings argue for shifting tool-use evaluation toward task completion under realistic, unreliable conditions.

Evaluation and Benchmarking Agent and Tool Ecosystem ToolBench-X Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability