SHOVIR benchmark exposes vision shortcut learning failures in radiology report generation VLMs
Researchers introduce SHOVIR, a benchmark for detecting 'vision shortcut' behavior in Vision-Language Models applied to Radiology Report Generation (RRG), where models achieve high scores by exploiting learned priors rather than actual image evidence. The benchmark extends MIMIC-CXR and PadChest-GR with per-box CheXpert labels and uses localized occlusion experiments to isolate two failure modes: direct shortcuts (findings persist after visual evidence is removed) and contextual shortcuts (detection degrades when co-occurring pathologies are occluded). Evaluating eight state-of-the-art VLMs, the authors find that high report quality does not correlate with strong spatial grounding, revealing a systematic blind spot in current RRG evaluation protocols.
Related guides (2)
Related events (8)
OCR-Robust benchmark evaluates VLM robustness to visual perturbations on OCR-reasoning tasks
Researchers introduce OCR-Robust, a benchmark of 812 samples designed to evaluate how vision-language models handle OCR-reasoning tasks under controlled visual degradation. The benchmark covers documents, scene text, charts, geometry, and tables, applying 5 perturbation types at 3 severity levels each, and evaluates 18 models using metrics including Relative Corruption Retention and a composite Corruption Robustness Index. Key findings show that higher clean accuracy does not guarantee robustness, and that chart and table inputs are substantially more fragile under perturbation than document-like inputs.
MedFocus: Causal Visual Attribution Framework for Chest X-ray Reasoning in Large Vision-Language Models
This paper addresses the faithfulness of visual attribution methods in Large Vision-Language Models (LVLMs) applied to chest X-ray (CXR) analysis. The authors develop a causal evaluation framework using counterfactual editing to verify whether expert-annotated regions are causally responsible for model predictions, testing 11 attribution methods across six open-source LVLMs. Finding that existing attribution methods frequently fail to identify the actual visual evidence used by models, they propose MedFocus, a concept-based attribution method using unbalanced optimal transport to localize anatomical regions and measure their causal effect on outputs. MedFocus substantially outperforms prior methods and provides spatial, concept-level, and token-level attributions.
TriViewBench: Controlled benchmark reveals fundamental multi-view spatial reasoning failures in MLLMs
Researchers introduce TriViewBench, a synthetic 3D benchmark of 1,923 scenes and 14K+ QA pairs designed to probe multi-view structural reasoning in MLLMs under controlled complexity scaling. Evaluating 18 open- and closed-source models, the study finds a universal capability hierarchy (Local Decision > Object Counting > Global Recovery) with severe performance collapse on Global Recovery tasks (80% relative drop at highest complexity). Chain-of-Thought prompting provides near-zero benefit, suggesting the bottleneck is cross-view spatial representation rather than reasoning strategy. The work identifies two mechanistically distinct failure modes in object counting: occlusion blindness causing undercounting in single-view tasks and cross-view identity confusion causing overcounting in multi-view tasks.
Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models
Chartographer is a framework for generating counterfactual chart variants to rigorously evaluate visual reasoning in vision-language models (VLMs), addressing the problem of shortcut-taking and prior knowledge exploitation in chart QA benchmarks. The system reverse-engineers charts into executable code, generates seed-controlled variants, and derives new ground-truth answers via executable QA logic. Evaluation of proprietary and open-source VLMs reveals that models frequently fail to generalize to counterfactual charts even after correctly answering the original, with failures most common when novel visual reasoning pathways are required.
RefRad2D dataset and RadGrounder model enable spatially grounded radiology VLMs without manual annotations
Researchers introduce RefRad2D, a 1.2M-pair bilingual (German/English) CT and MR image-text dataset generated automatically via LLM curation and automated segmentation, requiring no manual spatial annotations. The accompanying RadGrounder model jointly performs report generation, VQA, and spatial grounding via bounding-box or segmentation outputs. On external benchmarks Slake and VQA-RAD, RadGrounder matches specialized medical VLMs while adding grounding supervision without degrading language quality. The work demonstrates that large-scale automatically curated clinical data can transfer to downstream medical VQA tasks.
Act2Answer: Benchmarking commonsense and world knowledge retention in Vision-Language-Action models
Researchers introduce Act2Answer, a protocol for evaluating how much commonsense and factual knowledge VLA models retain after fine-tuning on robotics data. The approach converts knowledge benchmark questions into tabletop object-placement episodes, yielding action-grounded success rates that reduce confounds from low-level control failures. A large-scale study of 7 VLA models and 9 VLM baselines finds that VLAs retain solid performance on simple concepts but show larger gaps on richer semantic categories compared to their source VLMs, and that VQA co-training is associated with better knowledge retention.
Reroute: Training-free recoverable visual token routing for vision-language models
A new arXiv preprint proposes Reroute, a training-free plug-in that replaces the standard rank-and-remove visual token pruning paradigm in VLMs with a recoverable routing mechanism. Instead of permanently discarding low-ranked tokens, Reroute defers them to re-enter the candidate pool at later decoder stages, addressing the problem that token importance shifts across decoder depth. Evaluated on LLaVA-1.5 and Qwen backbones augmented with FastV, PDrop, and Nüwa pruning methods, Reroute improves grounding performance under aggressive token reduction without sacrificing general VQA accuracy. The approach preserves the theoretical compute and KV-cache budget of the underlying pruning method.
VSR models outperform humans on lipreading benchmarks but rely on language cues, not visual perception
A new arXiv paper compares three visual speech recognition (VSR) systems against human lipreaders on the MaFI dataset using word, character, phoneme, and viseme-level metrics. Despite higher overall accuracy, VSR models succeed and fail on different words than humans, and their errors are better explained by training word frequency than visual informativeness. A text-only n-gram baseline given minimal phoneme input rivals human performance, suggesting VSR systems primarily exploit language priors rather than genuine visual speech perception. The findings raise questions about whether benchmark-beating performance reflects the capability it purports to measure.

