5arXiv cs.AI (Artificial Intelligence)·1mo ago

EntityBench: Benchmark for Entity-Consistent Long-Range Multi-Shot Video Generation

EntityBench is a new benchmark comprising 140 episodes (2,491 shots) derived from real narrative media, designed to evaluate entity consistency—characters, objects, and locations—across long multi-shot video generation sequences. It introduces tiered difficulty up to 50 shots and recurrence gaps of up to 48 shots, paired with a three-pillar evaluation suite covering intra-shot quality, prompt alignment, and cross-shot consistency. The authors also propose EntityMem, a memory-augmented baseline that stores verified per-entity visual references in a persistent memory bank, achieving the highest character fidelity (Cohen's d = +2.33) among evaluated methods. Results show that cross-shot entity consistency degrades sharply with recurrence distance in existing approaches.

Evaluation and Benchmarking Multimodal Progress Cohen's d EntityMem Catherine R. He EntityBench

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

7arXiv · cs.AI·18d ago·source ↗

Moment-Video: Benchmark Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

Moment-Video is a new benchmark of 1,000 human-verified video-QA pairs designed to evaluate how well video multimodal large language models (MLLMs) handle brief, localized visual events that may span only a few frames. The benchmark covers 7 domains and 25 subcategories across four task types: Temporal Occurrence, Temporal Counting, Action Description, and Temporal Reasoning. Evaluation of 33 proprietary and open-source models reveals severe deficiencies: the best model (Seed-2.0-Pro) achieves only 39.6% accuracy, while most open-source models score below 25%. Diagnostic analyses show that denser frame sampling helps but does not resolve the bottleneck, pointing to fundamental limitations in how current video MLLMs represent and preserve transient visual evidence.

Long Context Evolution Evaluation and Benchmarking Multimodal Large Language Models Moment-Video Seed-2.0-Pro +4 more

5arXiv · cs.CL·11d ago·source ↗

Benchmark for view-level visual evidence identification in multi-view MLLMs for autonomous driving

A new arXiv preprint introduces a multi-view visual question answering benchmark targeting evidence-source identification in autonomous driving scenarios. Given six synchronized NuScenes camera views and a question, models must identify which camera view supports the answer — not just produce a correct answer. The 122-pair benchmark spans causality, counterfactual reasoning, and intent prediction, and exposes grounding failures that answer-only evaluation misses. The work addresses a meaningful gap between answer accuracy and correct visual grounding in safety-critical multimodal systems.

Evaluation and Benchmarking Multimodal Progress NuScenes Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving

6arXiv · cs.LG·1mo ago·source ↗

ESI-Bench: A Benchmark for Embodied Spatial Intelligence Closing the Perception-Action Loop

ESI-Bench is a new benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories, built on OmniGibson and grounded in Spelke's core knowledge systems. It evaluates agents that must actively deploy perception, locomotion, and manipulation to accumulate task-relevant evidence, rather than passively processing oracle observations. Experiments on state-of-the-art MLLMs reveal that active exploration outperforms passive baselines, but most failures stem from 'action blindness'—poor action choices leading to cascading errors—and a metacognitive gap where models commit prematurely with high confidence regardless of evidence quality. Human studies show humans seek falsifying viewpoints and revise beliefs under contradiction, a capability current models lack.

Evaluation and Benchmarking Agent and Tool Ecosystem ESI-Bench Multimodal Large Language Models OmniGibson +2 more

5arXiv · cs.CL·8d ago·source ↗

EvoArena benchmark and EvoMem memory paradigm for LLM agents in dynamic environments

Researchers introduce EvoArena, a benchmark suite that evaluates LLM agents in dynamic environments by modeling changes as progressive update sequences across terminal, software, and social domains. Alongside it, they propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories to help agents reason about environmental change. Current agents score only 39.6% average accuracy on EvoArena, while EvoMem yields consistent gains on EvoArena and also improves performance on GAIA and LoCoMo benchmarks. The work highlights a significant gap between static-benchmark performance and real-world dynamic deployment requirements.

Evaluation and Benchmarking Agent and Tool Ecosystem EvoArena GAIA LoCoMo +1 more

5arXiv · cs.LG·11d ago·source ↗

Echo-Memory: Controlled study isolates memory mechanisms in action-conditioned world models

Echo-Memory is a controlled benchmark study comparing memory mechanisms in action-conditioned video world models, fixing all other variables (backbone, optimizer, evaluation) to isolate how history storage and retrieval affect scene consistency across camera departures and returns. The study compares raw context, compression-based memory, spatial summaries, and state-space recurrence under a shared video diffusion backbone. Key findings: raw context is a strong baseline for open-domain return; aggressive compression loses salient evidence; and block-wise state-space recurrence is the strongest mechanism for remembering world state across long horizons. The three-branch evaluation protocol reveals that replay fidelity is not a reliable proxy for true world memory.

Evaluation and Benchmarking Multimodal Progress Echo-Memory

6arXiv · cs.CL·23d ago·source ↗

VisualMem: Personal Visual Memory Benchmark and Architecture for Personalized AI Agents

This paper introduces a benchmark and hybrid architecture (VisualMem) for personal visual memory in long-term AI agent memory systems. The work addresses a gap in existing text-centric memory systems by capturing both explicit evidence (recurring user-associated entities) and implicit evidence (latent user facts from visual/multimodal cues) from images. VisualMem augments a text-memory backend with a structured personal visual memory module that uses conversational context to resolve identity, ownership, and durable user facts. Experiments show VisualMem substantially outperforms prior memory systems on the new benchmark while remaining competitive on standard text-memory benchmarks.

Long Context Evolution Evaluation and Benchmarking VisualMem long-term memory Personal Visual Memory Benchmark +3 more

5Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

PEVA: Whole-Body Conditioned Egocentric Video Prediction for Embodied World Models

Researchers from BAIR introduce PEVA (Predicting Ego-centric Video from human Actions), a model that generates first-person video frames conditioned on 48-dimensional whole-body kinematic pose trajectories. The model uses an autoregressive conditional diffusion transformer trained on the Nymeria dataset, which pairs real-world egocentric video with body pose capture. PEVA can generate atomic action videos, simulate counterfactuals, and support long video generation, representing a step toward world models grounded in physically embodied human agents.

Agent and Tool Ecosystem Multimodal Progress PEVA Conditional Diffusion Transformer Berkeley AI Research (BAIR)+2 more

6arXiv · cs.AI·15d ago·source ↗

Benchmark Agent: Autonomous system for end-to-end benchmark construction

Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem Benchmark Everything Everywhere All at Once Benchmark Agent