benchmark
MaFI
benchmarkactiveprovisional
mafi-ceea98c3·1 events·first seen 9d agoAliases: MaFI
Co-occurring entities
More like this (12)
Recent events (1)
VSR models outperform humans on lipreading benchmarks but rely on language cues, not visual perception
A new arXiv paper compares three visual speech recognition (VSR) systems against human lipreaders on the MaFI dataset using word, character, phoneme, and viseme-level metrics. Despite higher overall accuracy, VSR models succeed and fail on different words than humans, and their errors are better explained by training word frequency than visual informativeness. A text-only n-gram baseline given minimal phoneme input rivals human performance, suggesting VSR systems primarily exploit language priors rather than genuine visual speech perception. The findings raise questions about whether benchmark-beating performance reflects the capability it purports to measure.