Cross-modal masking framework improves silent speech synthesis from sEMG and lipreading
Researchers propose a masked multimodal speech synthesis framework that jointly trains on surface electromyography (sEMG) and video-based lipreading signals using modality masking to improve robustness to sensor failure or degradation. In multispeaker settings, the approach reduces word error rate by up to 14 absolute percentage points over the strongest unimodal baseline. Masking strategies outperform degradation-specific data augmentation for handling missing modalities, with phone-level analysis revealing complementary contributions across vowels and consonant groups.
Related guides (1)
Related events (8)
MoE architecture improves self-supervised speech model robustness for anti-spoofing
Researchers propose converting a self-supervised speech representation model into a Mixture-of-Experts (MoE) architecture to improve generalization in synthetic speech detection. Feed-forward blocks in selected encoder layers are replaced by expert networks with a layer-wise gating mechanism, allowing complementary acoustic pattern capture while preserving pretrained representations. Evaluated across 14 spoofing datasets, the approach reduces macro Equal Error Rate from 5.46% to 4.81%, an 11.9% relative improvement over the baseline.
RL-based alignment improves interactivity in full-duplex spoken dialogue models
Researchers propose a post-training alignment method using reinforcement learning to improve interactivity in full-duplex spoken dialogue models, which can listen and speak simultaneously. The method addresses four canonical axes of interactivity—pause handling, turn-taking, backchanneling, and user interruption—each with axis-specific reward functions, plus an LLM-based reward to prevent semantic degradation. The approach is applied to two open-source models, Moshi and PersonaPlex, showing consistent improvements in both offline and real-time multi-turn evaluation.
Acoustic cue alignment tokens improve speech emotion recognition in audio language models
Researchers study whether instruction-following audio language models (ALMs) use explicit acoustic cues in a grounded way when raw audio is already available. They derive six interpretable acoustic concept tokens from the eGeMAPS feature set and append them to text prompts, testing on FAU-Aibo and IEMOCAP benchmarks. Aligned tokens improve unweighted average recall while shuffled or corrupted tokens degrade performance, but models don't fully collapse under perturbation, indicating partial anchoring to the audio signal. The work offers a practical probing method for interpretability and robustness in affective computing with ALMs.
VSR models outperform humans on lipreading benchmarks but rely on language cues, not visual perception
A new arXiv paper compares three visual speech recognition (VSR) systems against human lipreaders on the MaFI dataset using word, character, phoneme, and viseme-level metrics. Despite higher overall accuracy, VSR models succeed and fail on different words than humans, and their errors are better explained by training word frequency than visual informativeness. A text-only n-gram baseline given minimal phoneme input rivals human performance, suggesting VSR systems primarily exploit language priors rather than genuine visual speech perception. The findings raise questions about whether benchmark-beating performance reflects the capability it purports to measure.
Multilingual word-level forced alignment using MMS and learned dynamic programming outperforms MFA
Researchers present a forced alignment system combining Meta's Massively Multilingual Speech (MMS) model with a self-supervised phoneme boundary detector (UnSupSeg) and a learned dynamic programming decoder. Trained on TIMIT and Buckeye, the system outperforms Montreal Forced Aligner and MMS-based alignment on both datasets and generalizes to unseen languages (Dutch, German, Hebrew) without additional training. The approach claims potential to scale to 1100+ languages supported by MMS, making it relevant for low-resource speech processing pipelines.
Latent World Recovery: multimodal learning framework for missing modalities in bioscience
A new arXiv preprint introduces Latent World Recovery (LWR), a framework for multimodal learning when some modalities are unavailable at training or inference time. LWR aligns modality-specific embeddings in a shared latent space and fuses only available modalities, avoiding explicit reconstruction of missing ones. The approach is evaluated on incomplete multi-omics benchmarks for cancer phenotype classification and survival prediction, demonstrating robustness under partial observation.
LoMo: Local Modality Substitution for Deeper Vision-Language Fusion
This paper identifies a 'carrier sensitivity' problem in Vision-Language Models (VLMs), where replacing textual queries with rendered-image equivalents causes significant performance degradation due to asymmetric roles of text and images in training data. The authors propose Local Modality Substitution (LoMo), a data curation paradigm that reformulates single-modality prompts into interleaved multimodal sequences by dynamically rendering text spans as images, enforcing cross-modal representational invariance. Evaluated across 13 multimodal benchmarks, LoMo improves over standard supervised fine-tuning by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B. The approach is architecture-agnostic and lightweight, requiring no changes to model architecture.
BayLing-Duplex: Native full-duplex speech dialogue using a single autoregressive LLM
Researchers introduce BayLing-Duplex, a speech language model that achieves native full-duplex interaction — simultaneous listening and speaking — using a single autoregressive LLM with no auxiliary VAD or turn-taking module. Built by fine-tuning GLM-4-Voice on 400K samples plus a lightweight DPO stage, it reaches 92% turn-taking success and 100% interruption success on InstructS2S-Eval, and improves speech-response quality substantially over Moshi. The approach adds only special tokens to the standard vocabulary, making it portable across LLM architectures without architectural changes.
