6arXiv cs.CL (Computation and Language)·17h ago

Causal circuit analysis reveals how vision-language models resolve perception-knowledge conflicts

A new arXiv preprint uses activation patching and ablation studies to identify the mechanistic basis of perception-knowledge conflict in vision-language models across three VLM families. The authors find that visual grounding is the default behavior, while knowledge-grounded responses depend on a small set of attention heads (2.5–4.8% of total) concentrated in the network's second half. Ablating these heads flips knowledge-grounded predictions to visually grounded ones in 68–96% of cases while barely affecting visually grounded predictions, revealing an asymmetric causal structure. The identified heads decompose into routing heads and writing heads, and the circuit is consistent across model families and scales.

Evaluation and Benchmarking AI Safety Research Multimodal Progress Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models

Related guides (3)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·1mo ago·source ↗

The Abstraction Gap in Vision-Language Causal Reasoning

Researchers introduce a dual-probe methodology and the CAGE benchmark (49,500 questions across 5,500 images) to distinguish linguistic plausibility from faithful causal reasoning in vision-language models. An Abstraction Gap (AG) metric quantifies the normalized performance difference between text-only and chain-of-reasoning probes. Evaluating eight VLMs, seven exhibit AG exceeding 0.50—generating fluent causal text but failing structured causal chain tasks—while one model achieves near-zero AG, suggesting architectural and pretraining choices are decisive. Fine-tuning on 45,000 chain-annotated examples fails to close the gap, pointing to a fundamental capability distinction.

Evaluation and Benchmarking Agent and Tool Ecosystem Pearl's Causal Hierarchy CAGE Text-Only Probe +3 more

3Hugging Face Blog·1mo ago·source ↗

Vision Language Models Explained

A Hugging Face blog post providing a technical overview of vision language models (VLMs), covering their architecture, training approaches, and capabilities. The post serves as an educational resource explaining how VLMs combine visual and language understanding. As a tier-2 commentary piece, it synthesizes existing knowledge rather than presenting new research findings.

Multimodal Progress Vision-Language Models Hugging Face

5arXiv · cs.LG·11d ago·source ↗

Act2Answer: Benchmarking commonsense and world knowledge retention in Vision-Language-Action models

Researchers introduce Act2Answer, a protocol for evaluating how much commonsense and factual knowledge VLA models retain after fine-tuning on robotics data. The approach converts knowledge benchmark questions into tabletop object-placement episodes, yielding action-grounded success rates that reduce confounds from low-level control failures. A large-scale study of 7 VLA models and 9 VLM baselines finds that VLAs retain solid performance on simple concepts but show larger gaps on richer semantic categories compared to their source VLMs, and that VQA co-training is associated with better knowledge retention.

Evaluation and Benchmarking Multimodal Progress Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models Act2Answer

6arXiv · cs.CL·14d ago·source ↗

Gaze Heads: Attention heads in VLMs that track and control image region description

Researchers identify a small set of attention heads in vision-language model backbones, called 'gaze heads', whose attention patterns track the image region currently being described. Using comic strips as a controlled testbed, they show that intervening on the top-100 gaze heads (fewer than 9% of all heads) can steer the model to describe any chosen region at 83.1% accuracy, without retraining. The mechanism generalizes across model sizes from 2B to 32B parameters and to natural images (COCO), establishing a practical inference-time control lever for multimodal models via mechanistic analysis.

Multimodal Progress Gaze Heads: How VLMs Look at What They Describe baulab Gaze Heads: How VLMs Look at What They Describe +2 more

6arXiv · cs.AI·1mo ago·source ↗

MedFocus: Causal Visual Attribution Framework for Chest X-ray Reasoning in Large Vision-Language Models

This paper addresses the faithfulness of visual attribution methods in Large Vision-Language Models (LVLMs) applied to chest X-ray (CXR) analysis. The authors develop a causal evaluation framework using counterfactual editing to verify whether expert-annotated regions are causally responsible for model predictions, testing 11 attribution methods across six open-source LVLMs. Finding that existing attribution methods frequently fail to identify the actual visual evidence used by models, they propose MedFocus, a concept-based attribution method using unbalanced optimal transport to localize anatomical regions and measure their causal effect on outputs. MedFocus substantially outperforms prior methods and provides spatial, concept-level, and token-level attributions.

Evaluation and Benchmarking AI Safety Research Unbalanced Optimal Transport MedFocus Chest X-ray Reasoning +5 more

6arXiv · cs.AI·1mo ago·source ↗

Lost in Fog: Sensor Perturbations Expose Reasoning Fragility in Driving VLAs

This paper presents a controlled robustness study of Vision-Language-Action (VLA) models in autonomous driving, evaluating Alpamayo R1 (10B parameters) across ~18,000 inference trials under eight sensor perturbation types including noise, lighting extremes, and fog. The key finding is that Chain-of-Causation (CoC) reasoning consistency is a high-fidelity proxy for trajectory reliability: when CoC explanations change post-perturbation, trajectory deviation spikes 5.3× (r=0.99 across attack types). Enabling CoC generation is associated with 11.8% average improvement in trajectory accuracy, and degradation under noise is approximately linear (R²=0.957), while standard preprocessing defenses offer only marginal benefit.

Evaluation and Benchmarking AI Safety Research Vision-Language-Action model Chain-of-Causation autonomous driving +3 more

5arXiv · cs.CL·1mo ago·source ↗

Real Images, Worse Judgments: Evaluating VLMs on Concreteness and Imagery

This paper evaluates whether vision-language models (VLMs) benefit from real image context when making lexical judgments about word concreteness and imagery. The authors find that real-image contexts frequently hurt alignment with human ratings, especially when visual evidence is least relevant to the word being judged. Probing and canonical correlation analysis reveal that real images cause representational shifts and increased sensitivity to spurious visual cues. Instructing models to focus on text-only content at inference time partially mitigates this degradation.

Evaluation and Benchmarking Multimodal Progress concreteness ratings canonical correlation analysis Vision-Language Models +2 more

6arXiv · cs.CL·28d ago·source ↗

Vision-Language Models Suppress Female Representations Under Ambiguous Input

This paper investigates gender bias in vision-language models (VLMs) when inputs are ambiguous (e.g., workers in full gear or seen from behind), finding that models default to male outputs even for strongly female-stereotyped occupations. The authors introduce LALS (Latent Association Leaning Score), a zero-shot metric that probes internal visual-token activations to measure concept associations across layers. Across 15 occupations, 800+ ambiguous images, and four VLMs, they find a systematic decoupling: models internally encode female associations but suppress them before generation, with male signals amplifying end-to-end while female signals peak mid-network and are filtered out. Cultural visual cues like clothing color further modulate these internal associations.

Evaluation and Benchmarking AI Safety Research gender bias in VLMs Vision-Language Models visual-token activation probing +5 more