3arXiv cs.AI (Artificial Intelligence)·5d ago

MoE architecture improves self-supervised speech model robustness for anti-spoofing

Researchers propose converting a self-supervised speech representation model into a Mixture-of-Experts (MoE) architecture to improve generalization in synthetic speech detection. Feed-forward blocks in selected encoder layers are replaced by expert networks with a layer-wise gating mechanism, allowing complementary acoustic pattern capture while preserving pretrained representations. Evaluated across 14 spoofing datasets, the approach reduces macro Equal Error Rate from 5.46% to 4.81%, an 11.9% relative improvement over the baseline.

Evaluation and Benchmarking Mixture of Experts From Self-Supervised Speech Models to Mixture-of-Experts for Robust Anti-Spoofing

Related guides (2)

Mixture of ExpertsConcept

Mixture of Experts: How AI Models Do More by Using Less

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

Mixture of Experts Explained

This Hugging Face blog post provides a technical overview of the Mixture of Experts (MoE) architecture, explaining how sparse gating mechanisms route tokens to subsets of expert feed-forward layers to achieve computational efficiency. The post covers training dynamics, inference considerations, and the tradeoffs between dense and sparse models. It serves as a reference document contextualizing MoE's growing relevance following high-profile model releases using the architecture.

Training Infrastructure Frontier Model Releases Mixture of Experts Hugging Face sparse gating +1 more

4arXiv · cs.CL·11d ago·source ↗

Cross-modal masking framework improves silent speech synthesis from sEMG and lipreading

Researchers propose a masked multimodal speech synthesis framework that jointly trains on surface electromyography (sEMG) and video-based lipreading signals using modality masking to improve robustness to sensor failure or degradation. In multispeaker settings, the approach reduces word error rate by up to 14 absolute percentage points over the strongest unimodal baseline. Masking strategies outperform degradation-specific data augmentation for handling missing modalities, with phone-level analysis revealing complementary contributions across vowels and consonant groups.

Multimodal Progress Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading

6arXiv · cs.CL·1mo ago·source ↗

ZEDA: Post-Trained MoE Models Can Skip Half Their Experts via Self-Distillation

This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a framework that converts static post-trained Mixture-of-Experts (MoE) language models into dynamic ones without pre-training from scratch. ZEDA injects parameter-free zero-output experts into each MoE layer and uses two-stage self-distillation with the original model as a frozen teacher. Applied to Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks, ZEDA eliminates over 50% of expert FLOPs with marginal accuracy loss and achieves approximately 1.20× end-to-end inference speedup, outperforming the strongest dynamic MoE baseline by 4–6 points.

Training Infrastructure Frontier Model Releases Self-Distillation ZEDA (Zero-Expert Self-Distillation Adaptation)Qwen3-30B-A3B +3 more

4Hugging Face Blog·1mo ago·source ↗

Mixture of Experts (MoEs) in Transformers

A Hugging Face blog post covering Mixture of Experts (MoE) architectures as applied to transformer models. The post likely explains the technical foundations, training considerations, and practical deployment aspects of MoE models. Given the timing in early 2026, it likely contextualizes recent MoE-based frontier models and tooling support within the Hugging Face ecosystem.

Training Infrastructure Frontier Model Releases Transformers Mixture of Experts Hugging Face +1 more

5Hugging Face Blog·1mo ago·source ↗

EMO: Pretraining Mixture of Experts for Emergent Modularity

AllenAI introduces EMO, a pretraining approach for Mixture of Experts (MoE) models that aims to produce emergent modularity during training. The work explores how MoE architectures can develop specialized expert routing without explicit supervision. Published on the Hugging Face blog, this represents research-level work on improving MoE training dynamics and efficiency.

Training Infrastructure Frontier Model Releases AllenAI Mixture of Experts Hugging Face +2 more

5arXiv · cs.AI·11d ago·source ↗

Explainability pipeline reveals divergent cues used by deepfake speech detectors

Researchers propose an audio-native explainability pipeline using Integrated Gradients on time-aligned self-supervised representations to localize decision evidence in deepfake speech detectors. Applied to three WavLM-based detectors (AASIST, CA-MHFA, SLS) on the ASVspoof 5 benchmark, the method reveals that despite similar performance, each detector relies on fundamentally different cues: environmental noise, phoneme artifacts, and word boundaries respectively. Findings are validated via causal masking experiments that confirm performance degrades when primary cues are removed. The work advances interpretability of audio deepfake detection, relevant to AI safety and media authenticity.

Evaluation and Benchmarking AI Safety Research CA-MHFA Integrated Gradients SLS +4 more

6arXiv · cs.CL·4d ago·source ↗

Expert Tying reduces MoE LLM memory footprint by ~2x with minimal quality loss

Researchers introduce Expert Tying, an architectural modification for Mixture-of-Experts LLMs that shares expert parameters across consecutive transformer layers while keeping routing and attention layer-independent. Evaluated on OLMoE, Qwen3, and DeepSeek-style MoE architectures, the method achieves nearly 2x memory reduction with negligible perplexity or downstream quality degradation. The approach exploits parameter redundancy in MoE pathways to improve the compute-to-memory trade-off for training and inference.

Training Infrastructure Frontier Model Releases DeepSeek V4 Tying the Loop -- Tied Expert Layers in Mixture-of-Experts Language Models Expert Tying +3 more

7arXiv · cs.CL·24d ago·source ↗

MobileMoE: Scaling Mixture-of-Experts for Sub-Billion Parameter On-Device Deployment

MobileMoE introduces a family of on-device MoE language models with 0.3–0.9B active parameters and 1.3–5.3B total parameters, targeting mobile deployment under memory and compute constraints. The authors derive an on-device MoE scaling law identifying a sweet spot of moderate sparsity with fine-grained and shared experts, then train models through a four-stage recipe including quantization-aware training on open-source data. Across 14 benchmarks, MobileMoE matches or exceeds leading dense on-device LLMs with 2–4× fewer inference FLOPs, and delivers 1.8–3.8× faster prefill and 2.2–3.4× faster decode than dense baselines on commodity smartphones at comparable INT4 memory.

Training Infrastructure Frontier Model Releases MobileLLM-Pro OLMoE-1B-7B INT4 Quantization +7 more