paper
Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading
paperactiveprovisional
cross-modal-masking-for-robust-silent-speech-synthesis-using-semg-and-lipreading-e11b3e7d·1 events·first seen 8d agoAliases: Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading
More like this (12)
Leveraging Audio-LLMs to Filter Speech-to-Speech Training DataThe Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?Speaker Group Encoding in Self-supervised Speech Recognition ModelsFrom Self-Supervised Speech Models to Mixture-of-Experts for Robust Anti-SpoofingAcoustic Cue Alignment in Audio Language Models for Speech Emotion RecognitionBeyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language ModelsMulti-Faceted Interactivity Alignment in Full-Duplex Speech ModelsSpeech-to-SpeechLatent World Recovery for Multimodal Learning with Missing ModalitiesBeyond task performance: Decoding bioacoustic embeddings with speech featuresIWSLT 2026 Cross-Lingual Voice CloningModeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models
Recent events (1)
Cross-modal masking framework improves silent speech synthesis from sEMG and lipreading
Researchers propose a masked multimodal speech synthesis framework that jointly trains on surface electromyography (sEMG) and video-based lipreading signals using modality masking to improve robustness to sensor failure or degradation. In multispeaker settings, the approach reduces word error rate by up to 14 absolute percentage points over the strongest unimodal baseline. Masking strategies outperform degradation-specific data augmentation for handling missing modalities, with phone-level analysis revealing complementary contributions across vowels and consonant groups.