Entity · benchmark

PhysTool-Bench

benchmarkactivephystool-bench-a9f23fc8·1 events·first seen Jun 10, 2026

Aliases: PhysTool-Bench

Co-occurring entities

More like this (12)

ToolBench-X FoldBench Workbench TailorBench MTBench PostTrainBench SorryBench PinchBench ProgramBench SelectBench Phun-Bench PhysicalSafetyBench-1K

Recent events (1)

6arXiv · cs.CL·Jun 10, 2026·source ↗

PhysTool-Bench reveals severe gaps in MLLM physical tool use and embodied planning

Researchers introduce PhysTool-Bench, the first benchmark evaluating multimodal LLMs on physical tool use across 2,510 queries and 2,678 real-world tools spanning manufacturing, electrical work, agriculture, and healthcare. Evaluation of 13 leading MLLMs shows even the best model (Gemini-3.1-Pro) identifies only 58.7% of tools in a scene and completes just 21.0% of queries end-to-end. The results expose a two-level deficit: poor tool perception in realistic scenes and a much larger drop at the planning stage, indicating a lack of functional commonsense for mapping tools to task semantics. This pinpoints a critical bottleneck for embodied AI development.

Evaluation and Benchmarking Agent and Tool Ecosystem Google PhysTool-Bench Gemini-3.1-Pro +1 more