Almanac
technique

Evidence-Grounded F1

techniqueactiveprovisionalevidence-grounded-f1-2b68d306·1 events·first seen 22h ago

Aliases: Evidence-Grounded F1

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.AI·22h ago·source ↗

EG-VQA benchmark exposes gap between answer correctness and evidence grounding in Video-LLMs

Researchers introduce EG-VQA, a benchmark of 2,067 videos and 11,838 QA pairs designed to evaluate Video Large Language Models on both answer correctness and temporal evidence grounding simultaneously. A new metric, Evidence-Grounded F1 (EG-F1), jointly measures temporal alignment and semantic consistency against annotated ground-truth evidence. Experiments show that even strong proprietary models fail to reliably localize supporting evidence, revealing a fundamental gap between surface-level accuracy and faithful reasoning. The authors also propose EG-Reasoner, an evidence-supervised open-source model that achieves competitive results against proprietary systems.