The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?
the-lipreading-gap-do-vsr-models-perceive-visual-speech-like-human-lipreaders--f846addf·1 events·first seen 9d agoAliases: The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?
Co-occurring entities
More like this (12)
Recent events (1)
VSR models outperform humans on lipreading benchmarks but rely on language cues, not visual perception
A new arXiv paper compares three visual speech recognition (VSR) systems against human lipreaders on the MaFI dataset using word, character, phoneme, and viseme-level metrics. Despite higher overall accuracy, VSR models succeed and fail on different words than humans, and their errors are better explained by training word frequency than visual informativeness. A text-only n-gram baseline given minimal phoneme input rivals human performance, suggesting VSR systems primarily exploit language priors rather than genuine visual speech perception. The findings raise questions about whether benchmark-beating performance reflects the capability it purports to measure.