Entity · paper

Gaze Heads: How VLMs Look at What They Describe

paperactivegaze-heads-how-vlms-look-at-what-they-describe-f3ab77a7·1 events·first seen Jun 15, 2026

Aliases: Gaze Heads: How VLMs Look at What They Describe

Merged from

Co-occurring entities

More like this (12)

VisualMem 3D-Aware VLMs with Implicit and Explicit Geometries LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models Vision-Language Models gender bias in VLMs visual language model FakeVLM Social Gaze Consistency The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?GLM-4-Voice Gazer

Recent events (1)

6arXiv · cs.CL·Jun 15, 2026·source ↗

Gaze Heads: Attention heads in VLMs that track and control image region description

Researchers identify a small set of attention heads in vision-language model backbones, called 'gaze heads', whose attention patterns track the image region currently being described. Using comic strips as a controlled testbed, they show that intervening on the top-100 gaze heads (fewer than 9% of all heads) can steer the model to describe any chosen region at 83.1% accuracy, without retraining. The mechanism generalizes across model sizes from 2B to 32B parameters and to natural images (COCO), establishing a practical inference-time control lever for multimodal models via mechanistic analysis.

Multimodal Progress baulab Gaze Heads: How VLMs Look at What They Describe COCO