Entity · paper

Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading

paperactivecross-modal-masking-for-robust-silent-speech-synthesis-using-semg-and-lipreading-e11b3e7d·1 events·first seen Jun 9, 2026

Aliases: Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading

More like this (12)

Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data Interleaved Speech Language Models Latently Work In Text The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?Multimodal Voice Activity Projection Hierarchical Acoustic-Semantic Modeling: Modality Separation and Semantic Coherence for Full-Duplex SLMs Speaker Group Encoding in Self-supervised Speech Recognition Models From Self-Supervised Speech Models to Mixture-of-Experts for Robust Anti-Spoofing Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition Masked Image Modeling Beyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language Models CapSpeech-TTS Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

Recent events (1)

4arXiv · cs.CL·Jun 9, 2026·source ↗

Cross-modal masking framework improves silent speech synthesis from sEMG and lipreading

Researchers propose a masked multimodal speech synthesis framework that jointly trains on surface electromyography (sEMG) and video-based lipreading signals using modality masking to improve robustness to sensor failure or degradation. In multispeaker settings, the approach reduces word error rate by up to 14 absolute percentage points over the strongest unimodal baseline. Masking strategies outperform degradation-specific data augmentation for handling missing modalities, with phone-level analysis revealing complementary contributions across vowels and consonant groups.

Multimodal Progress Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading