VideoFDB: First Benchmark for Full-Duplex Audio-Visual Conversational Agent Evaluation
VideoFDB is introduced as the first benchmark targeting full-duplex audio-visual-to-audio-visual (AV2AV) conversational agents, filling a gap where existing full-duplex benchmarks evaluate only speech. It provides 237 dyadic video-call clips covering 11 nonverbal conversational dynamics, a perception/generation taxonomy, and an LM-as-judge rubric framework. Evaluation across open- and closed-source vision-speech agents reveals systematic failure modes including captioning collapse and visual-stream ignorance, and shows current systems cannot perform the streaming joint audiovisual grounding required for natural conversation. Cascaded speech-to-avatar architectures are found to be architecturally incapable of producing full-duplex nonverbal cues.
Related guides (4)
Related events (8)
OmniAgent: POMDP-based active perception agent for long video understanding with test-time scaling
Researchers introduce OmniAgent, a multimodal agent that reformulates long video understanding as a POMDP-based iterative Observation-Thought-Action cycle, selectively distilling audio-visual cues into persistent textual memory rather than processing all frames uniformly. The system uses Agentic Supervised Fine-Tuning and a novel reinforcement learning method (TAURA) with turn-level entropy for credit assignment. OmniAgent demonstrates positive test-time scaling and achieves state-of-the-art open-source results across ten benchmarks, with its 7B model outperforming Qwen2.5-VL-72B on LVBench (50.5% vs. 47.3%).
BayLing-Duplex: Native full-duplex speech dialogue using a single autoregressive LLM
Researchers introduce BayLing-Duplex, a speech language model that achieves native full-duplex interaction — simultaneous listening and speaking — using a single autoregressive LLM with no auxiliary VAD or turn-taking module. Built by fine-tuning GLM-4-Voice on 400K samples plus a lightweight DPO stage, it reaches 92% turn-taking success and 100% interruption success on InstructS2S-Eval, and improves speech-response quality substantially over Moshi. The approach adds only special tokens to the standard vocabulary, making it portable across LLM architectures without architectural changes.
M³Exam: Benchmark for Multimodal Memory in Realistic User-Agent Interactions
Researchers introduce M³Exam, a query-centric multimodal conversational memory benchmark designed to evaluate language agents on realistic user-agent interactions, including cross-modal grounding and implicit information inference. Existing benchmarks are critiqued for assuming sparse visuals and human-human interaction formats. The paper also proposes M³Proctor, a companion memory method that detects query modality bias and retrieves raw visual sources on demand, achieving 13% accuracy improvement while reducing index-construction time and retrieved tokens by over 70%.
Moment-Video: Benchmark Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events
Moment-Video is a new benchmark of 1,000 human-verified video-QA pairs designed to evaluate how well video multimodal large language models (MLLMs) handle brief, localized visual events that may span only a few frames. The benchmark covers 7 domains and 25 subcategories across four task types: Temporal Occurrence, Temporal Counting, Action Description, and Temporal Reasoning. Evaluation of 33 proprietary and open-source models reveals severe deficiencies: the best model (Seed-2.0-Pro) achieves only 39.6% accuracy, while most open-source models score below 25%. Diagnostic analyses show that denser frame sampling helps but does not resolve the bottleneck, pointing to fundamental limitations in how current video MLLMs represent and preserve transient visual evidence.
DRFLOW: Benchmark for Evaluating Agent Workflow Prediction from Heterogeneous Sources
Researchers introduce DRFLOW, a benchmark targeting a gap in deep research (DR) agent evaluation: predicting concrete, personalized action-step workflows rather than generating summaries or reports. The benchmark contains 100 tasks across five domains, grounded in over 3,900 sources, with seven diagnostic metrics covering factual grounding, step recovery, structural ordering, and personalization. A reference agent (DRFA) is also presented, improving over strong baselines by up to 10% average F1 but leaving substantial headroom, indicating workflow prediction remains a hard open problem for DR agents.
VisualMem: Personal Visual Memory Benchmark and Architecture for Personalized AI Agents
This paper introduces a benchmark and hybrid architecture (VisualMem) for personal visual memory in long-term AI agent memory systems. The work addresses a gap in existing text-centric memory systems by capturing both explicit evidence (recurring user-associated entities) and implicit evidence (latent user facts from visual/multimodal cues) from images. VisualMem augments a text-memory backend with a structured personal visual memory module that uses conversational context to resolve identity, ownership, and durable user facts. Experiments show VisualMem substantially outperforms prior memory systems on the new benchmark while remaining competitive on standard text-memory benchmarks.
VISTA: Hybrid user simulation toolkit for interactive agent evaluation
Researchers introduce VISTA, a user simulation framework designed to address limitations in current agent evaluation methods, which rely on static benchmarks that miss dynamic, multi-step failure modes. VISTA provides six metrics for measuring realism, capability coverage, and interaction effectiveness, and combines UI-based and API-based interactions in a hybrid simulator. The toolkit is evaluated in e-commerce and education customer service settings, showing more realistic and comprehensive coverage than existing approaches.
Evaluating Audio Reasoning with Big Bench Audio
Hugging Face introduces Big Bench Audio, a new benchmark designed to evaluate audio reasoning capabilities in AI models. The benchmark appears to extend the Big Bench evaluation framework into the audio domain, targeting multimodal models that process and reason over audio inputs. This release addresses a gap in evaluation tooling for audio-capable language models.



