5arXiv cs.CL (Computation and Language)·9d ago

Manifold Power Iteration redesigns MoE routers by aligning rows with expert singular directions

A new arXiv preprint proposes Manifold Power Iteration (MPI), a principled redesign of Mixture-of-Experts router matrices that aligns each router row with the principal singular direction of its associated expert. The method uses a 'Power-then-Retract' paradigm to enforce norm constraints while driving convergence toward these singular directions. Empirical validation spans MoE pretraining at scales from 1B to 11B parameters, showing improved model effectiveness.

Training Infrastructure Frontier Model Releases Redesign Mixture-of-Experts Routers with Manifold Power Iteration Manifold Power Iteration

Related guides (2)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Training InfrastructureTopic guide

Training Infrastructure: The Compute Arms Race Powering Modern AI

Read asBeginner In-depth

Related events (8)

7arXiv · cs.LG·26d ago·source ↗

Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models

Complete-muE is a framework for transferring hyperparameters across dense FFN and Mixture-of-Experts (MoE) transformer architectures, addressing limitations of existing tools like μP and SDE that cannot handle simultaneous architecture and token-per-expert changes. It uses a two-bridge system: Bridge I maps dense FFN to Dense MoE via active-width μP with normalized router scale, and Bridge II maps Dense MoE to sparse MoE via activated-expert scaling with a first-order SDE correction. The practical outcome is a 'tune dense once, transfer to all' recipe that enables near-optimal hyperparameter reuse across MoE configurations without costly re-tuning. Experiments on language model and diffusion model pretraining confirm stable hyperparameter optima across architectures and parameter counts.

Training Infrastructure Frontier Model Releases Transformers Mixture of Experts SDE (Stochastic Differential Equation LR scaling)+3 more

6arXiv · cs.CL·4d ago·source ↗

Expert Tying reduces MoE LLM memory footprint by ~2x with minimal quality loss

Researchers introduce Expert Tying, an architectural modification for Mixture-of-Experts LLMs that shares expert parameters across consecutive transformer layers while keeping routing and attention layer-independent. Evaluated on OLMoE, Qwen3, and DeepSeek-style MoE architectures, the method achieves nearly 2x memory reduction with negligible perplexity or downstream quality degradation. The approach exploits parameter redundancy in MoE pathways to improve the compute-to-memory trade-off for training and inference.

Training Infrastructure Frontier Model Releases DeepSeek V4 Tying the Loop -- Tied Expert Layers in Mixture-of-Experts Language Models Expert Tying +3 more

5Hugging Face Blog·1mo ago·source ↗

EMO: Pretraining Mixture of Experts for Emergent Modularity

AllenAI introduces EMO, a pretraining approach for Mixture of Experts (MoE) models that aims to produce emergent modularity during training. The work explores how MoE architectures can develop specialized expert routing without explicit supervision. Published on the Hugging Face blog, this represents research-level work on improving MoE training dynamics and efficiency.

Training Infrastructure Frontier Model Releases AllenAI Mixture of Experts Hugging Face +2 more

6Qwen Research·1mo ago·source ↗

Global-batch Load Balancing for MoE LLM Training from Qwen

Qwen Research introduces a global-batch load balancing technique for Mixture-of-Experts (MoE) LLM training, claiming it is nearly a 'free lunch' improvement. The method addresses expert load imbalance across training batches, a known efficiency and quality bottleneck in MoE architectures. The approach targets the router and expert activation dynamics in transformer-based MoE layers.

Training Infrastructure Frontier Model Releases Global-batch Load Balancing Alibaba Qwen +1 more

5Hugging Face Blog·1mo ago·source ↗

Mixture of Experts Explained

This Hugging Face blog post provides a technical overview of the Mixture of Experts (MoE) architecture, explaining how sparse gating mechanisms route tokens to subsets of expert feed-forward layers to achieve computational efficiency. The post covers training dynamics, inference considerations, and the tradeoffs between dense and sparse models. It serves as a reference document contextualizing MoE's growing relevance following high-profile model releases using the architecture.

Training Infrastructure Frontier Model Releases Mixture of Experts Hugging Face sparse gating +1 more

4arXiv · cs.AI·46h ago·source ↗

Calibrated Mixture-of-Experts under distribution shift: adversarial reweighting approach

A new arXiv preprint analyzes how mixture-of-experts (MoE) models maintain calibration under distribution shift, examining the interaction between routing mechanisms and expert-level calibration. The authors prove that expert calibration is sufficient for overall model calibration in hard-routed MoE but insufficient for soft-routed variants. To address the soft-routing gap, they propose an adversarial reweighting method that penalizes calibration errors of the routed aggregate under distribution shift, demonstrating improved accuracy-calibration tradeoffs across model classes and tasks.

Frontier Model Releases Evaluation and Benchmarking Toward Calibrated Mixture-of-Experts Under Distribution Shift +1 more

6arXiv · cs.CL·11d ago·source ↗

Causal audit finds routing statistics do not predict expert importance in MoE pruning

A new arXiv paper conducts a token-level interventional audit of Mixture-of-Experts (MoE) pruning heuristics across three architectures (OLMoE-1B-7B, Qwen1.5-MoE, DeepSeek-V2-Lite), finding that no standard observational metric — utilization rates, activation norms, routing weight distributions — reliably predicts which experts can be removed without functional cost. Effect sizes fall below Cohen's d = 0.17 across all 60 metric-layer combinations after multiple-comparison correction, with only a single significant signal at OLMoE's final layer. The authors argue that existing pruning methods succeed not because they identify dispensable experts but because early-layer redundancy makes most selection criteria interchangeable. The work frames this as a concrete counterexample to the broader interpretability practice of treating associational (rung-1) evidence as interventional (rung-2) conclusions.

Evaluation and Benchmarking Inference Economics OLMoE-1B-7B-0924 From Observation to Intervention: A Causal Audit of Expert Importance in Mixture-of-Experts Models Qwen1.5-MoE-A2.7B +2 more

5arXiv · cs.CL·18d ago·source ↗

CRAM: Centroid-Routing and Adaptive MoE for Multimodal Continual Instruction Tuning

CRAM is a new method for Multimodal Continual Instruction Tuning (MCIT) that addresses the tension between catastrophic forgetting and parameter efficiency in MLLMs. It combines adaptive-rank instantiation to dynamically allocate parameters based on capability gaps, centroid-guided routing to reuse existing expert knowledge, and an orthogonality penalty to confine new updates to task-specific directions. The approach uses a Mixture-of-Experts architecture where task-specific patterns are isolated into independent modules, avoiding both the interference of shared updates and the parameter bloat of fully isolated expansion. Experiments across diverse benchmarks show consistent improvements over existing MCIT methods.

Enterprise Deployment Patterns Agent and Tool Ecosystem Multimodal Large Language Models CRAM centroid-guided routing +4 more