5arXiv cs.CL (Computation and Language)·2d ago

MECoBench: Benchmark for Multimodal Agent Collaboration in Embodied Environments

Researchers introduce MECoBench, a benchmark and evaluation platform for assessing multimodal LLM collaboration in visually grounded embodied environments. The benchmark spans diverse real-world tasks, two cooperation structures, and three collaboration modes. Key findings include that collaboration generally improves task completion but depends on balancing gains against coordination complexity, that communication is essential to collaboration benefits, and that collaboration improves robustness under noisy conditions.

Evaluation and Benchmarking Agent and Tool Ecosystem Multimodal Progress MECoBench

Related guides (3)

Evaluation and BenchmarkingTopic guide

AI Evaluation and Benchmarking: From Leaderboards to the Limits of Measurement

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·Jun 5, 2026·source ↗

CollabSim: CSCW-grounded framework for evaluating collaborative competence in LLM multi-agent systems

Researchers introduce CollabSim, a configurable simulation framework for systematically evaluating collaborative competence in LLM-based multi-agent systems (MAS). The framework draws on Computer-Supported Cooperative Work (CSCW) theory to define collaborative capabilities beyond task outcomes, including common ground establishment, shared task understanding, and misalignment repair. Experiments across four LLMs demonstrate the framework can distinguish model performance patterns and reveal task-dependent effects of agent design choices. The work addresses a gap in MAS evaluation, which has historically focused on individual task-solving rather than coordination quality.

Evaluation and Benchmarking Agent and Tool Ecosystem CollabSim

5arXiv · cs.CL·Jun 8, 2026·source ↗

M³Exam: Benchmark for Multimodal Memory in Realistic User-Agent Interactions

Researchers introduce M³Exam, a query-centric multimodal conversational memory benchmark designed to evaluate language agents on realistic user-agent interactions, including cross-modal grounding and implicit information inference. Existing benchmarks are critiqued for assuming sparse visuals and human-human interaction formats. The paper also proposes M³Proctor, a companion memory method that detects query modality bias and retrieves raw visual sources on demand, achieving 13% accuracy improvement while reducing index-construction time and retrieved tokens by over 70%.

Evaluation and Benchmarking Agent and Tool Ecosystem M³Exam M³Proctor +1 more

6arXiv · cs.LG·May 19, 2026·source ↗

ESI-Bench: A Benchmark for Embodied Spatial Intelligence Closing the Perception-Action Loop

ESI-Bench is a new benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories, built on OmniGibson and grounded in Spelke's core knowledge systems. It evaluates agents that must actively deploy perception, locomotion, and manipulation to accumulate task-relevant evidence, rather than passively processing oracle observations. Experiments on state-of-the-art MLLMs reveal that active exploration outperforms passive baselines, but most failures stem from 'action blindness'—poor action choices leading to cascading errors—and a metacognitive gap where models commit prematurely with high confidence regardless of evidence quality. Human studies show humans seek falsifying viewpoints and revise beliefs under contradiction, a capability current models lack.

Evaluation and Benchmarking Agent and Tool Ecosystem ESI-Bench Multimodal Large Language Models OmniGibson +2 more

5arXiv · cs.CL·Jun 23, 2026·source ↗

EnterpriseClawBench: A benchmark for enterprise agents derived from real workplace sessions

Researchers introduce EnterpriseClawBench, an enterprise agent benchmark constructed from proprietary real-world workplace sessions, yielding 852 reproducible tasks with fixtures, prompts, role classes, skill subclasses, and semantic rubrics. Because the sessions contain internal enterprise content, the benchmark data is not publicly released, but the construction and evaluation protocol is the reusable contribution. The best evaluated configuration (Codex with GPT-5.5) achieves only 0.663, indicating substantial headroom. The paper argues enterprise agent evaluation must report harness-model combinations, artifact delivery, visual quality, cost, runtime, and skill-transfer behavior rather than collapsing to a single score.

Evaluation and Benchmarking Enterprise Deployment Patterns FrontisAI EnterpriseClawBench Codex +2 more

5arXiv · cs.AI·4d ago·source ↗

LLawCo framework teaches embodied multi-agent LLMs to derive and follow cooperation laws

Researchers from MERL propose LLawCo (Learning Laws of Cooperation), a framework that enables embodied LLM-based agents to autonomously align with partners and task objectives in decentralized, partially observable environments. Agents reflect on past failures to extract misaligned behavioral patterns and derive high-level behavioral laws (e.g., 'Talk when necessary', 'Wait for partner'), which are incorporated into reasoning via supervised fine-tuning. The authors also introduce PARTNR-Dialog, a new large-scale multi-agent communicative planning benchmark, and report average success rate improvements of 4.5% on PARTNR-Dialog and 6.8% on TDW-MAT over state-of-the-art open-source communicative agent frameworks across four backbone LLMs.

Evaluation and Benchmarking Agent and Tool Ecosystem LLawCo MERL PARTNR +2 more

6arXiv · cs.AI·Jun 5, 2026·source ↗

Benchmark Agent: Autonomous system for end-to-end benchmark construction

Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem Benchmark Everything Everywhere All at Once Benchmark Agent

6arXiv · cs.CL·Jun 10, 2026·source ↗

PhysTool-Bench reveals severe gaps in MLLM physical tool use and embodied planning

Researchers introduce PhysTool-Bench, the first benchmark evaluating multimodal LLMs on physical tool use across 2,510 queries and 2,678 real-world tools spanning manufacturing, electrical work, agriculture, and healthcare. Evaluation of 13 leading MLLMs shows even the best model (Gemini-3.1-Pro) identifies only 58.7% of tools in a scene and completes just 21.0% of queries end-to-end. The results expose a two-level deficit: poor tool perception in realistic scenes and a much larger drop at the planning stage, indicating a lack of functional commonsense for mapping tools to task semantics. This pinpoints a critical bottleneck for embodied AI development.

Evaluation and Benchmarking Agent and Tool Ecosystem Google PhysTool-Bench Gemini-3.1-Pro +1 more

4arXiv · cs.CL·Jun 10, 2026·source ↗

T1-Bench: Multi-scenario agent benchmark across 25 real-world domains

T1-Bench is a new benchmark for evaluating agentic LLM systems in realistic customer-facing, multi-domain environments, covering 25 domains of varying difficulty with interleaved multi-turn scenarios. The authors evaluate 12 proprietary and open-weight models and combine automatic evaluation with human judgments. The benchmark targets gaps in existing agent evals around task complexity, domain diversity, and compositional reasoning across multi-step interactions.

Evaluation and Benchmarking Agent and Tool Ecosystem T1-Bench

MECoBench: Benchmark for Multimodal Agent Collaboration in Embodied Environments

Related events (8)

5arXiv · cs.CL·Jun 5, 2026·source ↗