Entity · benchmark

ESI-Bench

benchmarkactiveesi-bench-fa707aab·1 events·first seen May 19, 2026

Aliases: ESI-Bench

Co-occurring entities

Multimodal Large Language Models OmniGibson Spelke Core Knowledge Systems

More like this (12)

IT-Bench IVEBench EdgeBench EntityBench SpecBench FinBench SWE-bench SpatialBench HealthBench SelectBench Int-Bench SorryBench

Recent events (1)

6arXiv · cs.LG·May 19, 2026·source ↗

ESI-Bench: A Benchmark for Embodied Spatial Intelligence Closing the Perception-Action Loop

ESI-Bench is a new benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories, built on OmniGibson and grounded in Spelke's core knowledge systems. It evaluates agents that must actively deploy perception, locomotion, and manipulation to accumulate task-relevant evidence, rather than passively processing oracle observations. Experiments on state-of-the-art MLLMs reveal that active exploration outperforms passive baselines, but most failures stem from 'action blindness'—poor action choices leading to cascading errors—and a metacognitive gap where models commit prematurely with high confidence regardless of evidence quality. Human studies show humans seek falsifying viewpoints and revise beliefs under contradiction, a capability current models lack.

Evaluation and Benchmarking Agent and Tool Ecosystem ESI-Bench Multimodal Large Language Models OmniGibson +2 more