OLIVE: Joint masked latent prediction and waveform reconstruction for self-supervised speech representation learning
OLIVE (Online Latent prediction with Invariant Views and rEconstruction) is a new self-supervised speech representation learning framework that combines view-augmented masked latent prediction with waveform reconstruction under a unified objective. The reconstruction objective constrains early encoder features to retain signal-level information, while masked latent prediction shapes later contextual representations toward invariance. The authors report improvements on generation and speaker tasks, competitive performance on recognition and semantic tasks, and better waveform reconstruction compared to prior SSL approaches.
Related guides (1)
Related events (8)
Cross-modal masking framework improves silent speech synthesis from sEMG and lipreading
Researchers propose a masked multimodal speech synthesis framework that jointly trains on surface electromyography (sEMG) and video-based lipreading signals using modality masking to improve robustness to sensor failure or degradation. In multispeaker settings, the approach reduces word error rate by up to 14 absolute percentage points over the strongest unimodal baseline. Masking strategies outperform degradation-specific data augmentation for handling missing modalities, with phone-level analysis revealing complementary contributions across vowels and consonant groups.
LeVo 2: Hybrid LLM-Diffusion framework for stable full-length song generation with hierarchical modeling
LeVo 2 is a new hybrid LLM-Diffusion system for controllable full-length song generation that addresses the coherence-vs-acoustics trade-off through hierarchical token prediction: a language model handles semantic planning via mixed tokens, then predicts vocal and accompaniment tracks in parallel, while a diffusion-based codec reconstructs waveforms. A key contribution is an aesthetics-guided progressive post-training schedule combining SFT, offline DPO, and semi-online DPO to separately optimize quality, controllability, and musicality. Expert listening tests show LeVo 2 outperforms open-source baselines across six subjective dimensions and approaches leading commercial systems on several metrics.
Vision-OPD: On-Policy Self-Distillation for Fine-Grained Visual Understanding in MLLMs
Vision-OPD addresses a 'regional-to-global perception gap' in multimodal LLMs, where models answer fine-grained visual questions more accurately when given cropped evidence regions than full images. The method instantiates a crop-conditioned teacher and full-image-conditioned student from the same MLLM, minimizing token-level divergence along on-policy rollouts to transfer regional perception to the full-image policy. This self-distillation requires no external teacher models, ground-truth labels, reward verifiers, or inference-time tools. Benchmarks show competitive or superior performance against larger open-source, closed-source, and agentic 'Thinking-with-Images' models.
MoE architecture improves self-supervised speech model robustness for anti-spoofing
Researchers propose converting a self-supervised speech representation model into a Mixture-of-Experts (MoE) architecture to improve generalization in synthetic speech detection. Feed-forward blocks in selected encoder layers are replaced by expert networks with a layer-wise gating mechanism, allowing complementary acoustic pattern capture while preserving pretrained representations. Evaluated across 14 spoofing datasets, the approach reduces macro Equal Error Rate from 5.46% to 4.81%, an 11.9% relative improvement over the baseline.
d-OPSD: First on-policy self-distillation framework tailored for diffusion LLMs
Researchers introduce d-OPSD, the first on-policy self-distillation (OPSD) framework designed specifically for diffusion large language models (dLLMs). The method addresses a fundamental mismatch between existing autoregressive OPSD approaches and dLLMs' arbitrary-order generation by using suffix conditioning on self-generated answers and step-level rather than token-level divergence supervision. Across four reasoning benchmarks, d-OPSD outperforms RLVR and SFT baselines while requiring only ~10% of the optimization steps of RLVR, suggesting strong sample efficiency gains for dLLM post-training.
Latent World Recovery: multimodal learning framework for missing modalities in bioscience
A new arXiv preprint introduces Latent World Recovery (LWR), a framework for multimodal learning when some modalities are unavailable at training or inference time. LWR aligns modality-specific embeddings in a shared latent space and fuses only available modalities, avoiding explicit reconstruction of missing ones. The approach is evaluated on incomplete multi-omics benchmarks for cancer phenotype classification and survival prediction, demonstrating robustness under partial observation.
TEVI: Sparse autoencoders for text-conditioned editing of CLIP image embeddings to improve vision-language alignment
TEVI is a framework that uses sparse autoencoders to disentangle CLIP image embeddings and a learned masking module to selectively reconstruct embeddings conditioned on a given caption, addressing the information imbalance between images and their captions. The approach improves image-text retrieval on both coarse-grained benchmarks (MS COCO, Flickr) and fine-grained long-caption benchmarks (IIW, DOCCI), with larger gains on richer captions. The work also shows improved robustness on the RoCOCO benchmark.
Audio Interaction Model: Unified Streaming LALM with Always-On Perceive-Decide-Respond Loop
Researchers introduce the Audio Interaction Model framework and a concrete implementation called Audio-Interaction, a unified streaming Large Audio Language Model that handles both offline tasks and real-time audio interaction through a continuous perceive-decide-respond loop. The system is built on SoundFlow, a framework covering data construction, training, and asynchronous low-latency inference. The authors also release StreamAudio-2M, a 2.6M-item streaming corpus spanning 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Evaluated across 8 benchmarks, the model preserves competitive offline performance while enabling real-time ASR, streaming instruction following, and proactive response capabilities not available in prior offline LALMs.
