Almanac
benchmark

EG-VQA

benchmarkactiveprovisionaleg-vqa-00e7df66·1 events·first seen 23h ago

Aliases: EG-VQA

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.AI·23h ago·source ↗

EG-VQA benchmark exposes gap between answer correctness and evidence grounding in Video-LLMs

Researchers introduce EG-VQA, a benchmark of 2,067 videos and 11,838 QA pairs designed to evaluate Video Large Language Models on both answer correctness and temporal evidence grounding simultaneously. A new metric, Evidence-Grounded F1 (EG-F1), jointly measures temporal alignment and semantic consistency against annotated ground-truth evidence. Experiments show that even strong proprietary models fail to reliably localize supporting evidence, revealing a fundamental gap between surface-level accuracy and faithful reasoning. The authors also propose EG-Reasoner, an evidence-supervised open-source model that achieves competitive results against proprietary systems.