EG-VQA benchmark exposes gap between answer correctness and evidence grounding in Video-LLMs
Researchers introduce EG-VQA, a benchmark of 2,067 videos and 11,838 QA pairs designed to evaluate Video Large Language Models on both answer correctness and temporal evidence grounding simultaneously. A new metric, Evidence-Grounded F1 (EG-F1), jointly measures temporal alignment and semantic consistency against annotated ground-truth evidence. Experiments show that even strong proprietary models fail to reliably localize supporting evidence, revealing a fundamental gap between surface-level accuracy and faithful reasoning. The authors also propose EG-Reasoner, an evidence-supervised open-source model that achieves competitive results against proprietary systems.
Related guides (2)
Related events (8)
Benchmark for view-level visual evidence identification in multi-view MLLMs for autonomous driving
A new arXiv preprint introduces a multi-view visual question answering benchmark targeting evidence-source identification in autonomous driving scenarios. Given six synchronized NuScenes camera views and a question, models must identify which camera view supports the answer — not just produce a correct answer. The 122-pair benchmark spans causality, counterfactual reasoning, and intent prediction, and exposes grounding failures that answer-only evaluation misses. The work addresses a meaningful gap between answer accuracy and correct visual grounding in safety-critical multimodal systems.
Act2Answer: Benchmarking commonsense and world knowledge retention in Vision-Language-Action models
Researchers introduce Act2Answer, a protocol for evaluating how much commonsense and factual knowledge VLA models retain after fine-tuning on robotics data. The approach converts knowledge benchmark questions into tabletop object-placement episodes, yielding action-grounded success rates that reduce confounds from low-level control failures. A large-scale study of 7 VLA models and 9 VLM baselines finds that VLAs retain solid performance on simple concepts but show larger gaps on richer semantic categories compared to their source VLMs, and that VQA co-training is associated with better knowledge retention.
Moment-Video: Benchmark Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events
Moment-Video is a new benchmark of 1,000 human-verified video-QA pairs designed to evaluate how well video multimodal large language models (MLLMs) handle brief, localized visual events that may span only a few frames. The benchmark covers 7 domains and 25 subcategories across four task types: Temporal Occurrence, Temporal Counting, Action Description, and Temporal Reasoning. Evaluation of 33 proprietary and open-source models reveals severe deficiencies: the best model (Seed-2.0-Pro) achieves only 39.6% accuracy, while most open-source models score below 25%. Diagnostic analyses show that denser frame sampling helps but does not resolve the bottleneck, pointing to fundamental limitations in how current video MLLMs represent and preserve transient visual evidence.
The Abstraction Gap in Vision-Language Causal Reasoning
Researchers introduce a dual-probe methodology and the CAGE benchmark (49,500 questions across 5,500 images) to distinguish linguistic plausibility from faithful causal reasoning in vision-language models. An Abstraction Gap (AG) metric quantifies the normalized performance difference between text-only and chain-of-reasoning probes. Evaluating eight VLMs, seven exhibit AG exceeding 0.50—generating fluent causal text but failing structured causal chain tasks—while one model achieves near-zero AG, suggesting architectural and pretraining choices are decisive. Fine-tuning on 45,000 chain-annotated examples fails to close the gap, pointing to a fundamental capability distinction.
TempGlitch: Benchmark for Evaluating VLMs on Temporal Glitch Detection in Gameplay Videos
TempGlitch is a new benchmark designed to evaluate vision-language models on temporal glitch detection in gameplay videos, distinguishing temporal anomalies (visible only across ordered frames) from spatial ones (visible in a single frame). The benchmark covers five temporal glitch types with paired glitch-free videos for binary evaluation, and tests 12 proprietary and open-weight VLMs across multiple frame-sampling settings. Results show current VLMs perform near chance on temporal glitches, with neither denser frame sampling nor larger model size reliably improving detection. The work highlights a systematic gap in VLM temporal reasoning capabilities relevant to automated video quality assurance.
WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata
WikiVQABench is a new human-curated VQA benchmark that requires external knowledge beyond visual perception, constructed by combining Wikipedia images, captions, and Wikidata structured knowledge with LLM-generated question candidates reviewed by human annotators. The benchmark evaluates knowledge-intensive reasoning in vision-language models, covering 15 VLMs ranging from 256M to 90B parameters. Accuracy spans 24.7% to 75.6%, indicating meaningful discrimination across model scales. The dataset and code are publicly released.
LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning?
This Hugging Face blog post introduces LAVE (LLM-Assisted Visual Evaluation), a zero-shot VQA evaluation methodology applied to the Docmatix dataset. The post investigates whether large vision-language models can perform document visual question answering without task-specific fine-tuning by leveraging LLM-based evaluation metrics. The analysis probes the gap between zero-shot and fine-tuned performance on document understanding tasks, raising questions about the continued necessity of supervised adaptation for VQA.
ParaEval framework reduces MCQA benchmark sensitivity to answer phrasing
A new arXiv preprint identifies a systematic flaw in multiple-choice QA benchmarks: log-likelihood scoring conflates surface-form familiarity with actual capability, producing false performance gaps exceeding 2 points between models trained on identical knowledge. The authors propose ParaEval, which queries models with multiple paraphrases per answer option and scores on the most favorable phrasing, reducing the false gap to below 1 point. The effect is confirmed on frontier 70B and 120B open-source models, suggesting widespread benchmark inflation in standard MCQA evaluations.

