paper
Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving
paperactiveprovisional
where-does-the-answer-come-from-benchmarking-view-level-visual-evidence-identification-in-multi-view-mllms-for-autonomous-driving-48d46da1·1 events·first seen 8d agoAliases: Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving
Co-occurring entities
More like this (12)
Watch, Remember, Reason: Human-View Video Understanding with MLLMsVisual Verification Enables Inference-time Steering and Autonomous Policy ImprovementVisual Question AnsweringBias Benchmark for Question AnsweringDocument Visual Question AnsweringHuman-Vehicle Interaction BenchmarkMultiview CountingLong-context Reasoning BenchmarksExpert-Aware Causal Tracing of Factual Recall in Sparse MoE Language ModelsMultiple Instance Learning (MIL)Multi-head Latent Attention (MLA)Planning-aligned Token Compression for Long-Context Autonomous Driving
Recent events (1)
Benchmark for view-level visual evidence identification in multi-view MLLMs for autonomous driving
A new arXiv preprint introduces a multi-view visual question answering benchmark targeting evidence-source identification in autonomous driving scenarios. Given six synchronized NuScenes camera views and a question, models must identify which camera view supports the answer — not just produce a correct answer. The 122-pair benchmark spans causality, counterfactual reasoning, and intent prediction, and exposes grounding failures that answer-only evaluation misses. The work addresses a meaningful gap between answer accuracy and correct visual grounding in safety-critical multimodal systems.