Entity · paper

The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

paperactivethe-lipreading-gap-do-vsr-models-perceive-visual-speech-like-human-lipreaders--f846addf·1 events·first seen Jun 8, 2026

Aliases: The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

Co-occurring entities

MaFI

More like this (12)

Vision-Language Models Seeing Is Not Sharing: Some Vision-Language Models Overestimate Common Ground in Asymmetric Dialogue Small Vision-Language Models Know When They Are Wrong But Cannot Say So Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading ENTRAP-VL: A Taxonomic Probe for Dual Contextual Entrainment in Vision-Language Models LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models Gaze Heads: How VLMs Look at What They Describe visual language model Interleaved Speech Language Models Latently Work In Text Evolution of Accuracy and Visual-Cognitive Errors in a Decade of Vision-Language AI Models

Recent events (1)

5arXiv · cs.CL·Jun 8, 2026·source ↗

VSR models outperform humans on lipreading benchmarks but rely on language cues, not visual perception

A new arXiv paper compares three visual speech recognition (VSR) systems against human lipreaders on the MaFI dataset using word, character, phoneme, and viseme-level metrics. Despite higher overall accuracy, VSR models succeed and fail on different words than humans, and their errors are better explained by training word frequency than visual informativeness. A text-only n-gram baseline given minimal phoneme input rivals human performance, suggesting VSR systems primarily exploit language priors rather than genuine visual speech perception. The findings raise questions about whether benchmark-beating performance reflects the capability it purports to measure.

Evaluation and Benchmarking Multimodal Progress MaFI The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?