6arXiv cs.CL (Computation and Language)·11d ago

Causal audit finds routing statistics do not predict expert importance in MoE pruning

A new arXiv paper conducts a token-level interventional audit of Mixture-of-Experts (MoE) pruning heuristics across three architectures (OLMoE-1B-7B, Qwen1.5-MoE, DeepSeek-V2-Lite), finding that no standard observational metric — utilization rates, activation norms, routing weight distributions — reliably predicts which experts can be removed without functional cost. Effect sizes fall below Cohen's d = 0.17 across all 60 metric-layer combinations after multiple-comparison correction, with only a single significant signal at OLMoE's final layer. The authors argue that existing pruning methods succeed not because they identify dispensable experts but because early-layer redundancy makes most selection criteria interchangeable. The work frames this as a concrete counterexample to the broader interpretability practice of treating associational (rung-1) evidence as interventional (rung-2) conclusions.

Evaluation and Benchmarking Inference Economics AI Safety Research OLMoE-1B-7B-0924 From Observation to Intervention: A Causal Audit of Expert Importance in Mixture-of-Experts Models Qwen1.5-MoE-A2.7B DeepSeek Coder V2 lite

Related guides (3)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·17d ago·source ↗

Expert-aware causal tracing of factual recall in sparse MoE language models

A new arXiv preprint extends causal tracing methodology to sparse mixture-of-experts (MoE) language models, asking which routed experts mediate factual recall rather than just which layers or feed-forward modules. Using CounterFact facts, the authors apply noise-corruption and clean-patch interventions to Qwen3-30B-A3B-Base and Mixtral-8x7B-v0.1, finding that expert-level localization is possible in the former (a single expert at layer 44) but requires multi-expert coalition recovery in the latter. The results indicate that factual localization in MoE models is model- and protocol-dependent rather than universal.

Evaluation and Benchmarking AI Safety Research Qwen3-30B-A3B-Base Mixtral-8x7B-v0.1 Expert-Aware Causal Tracing of Factual Recall in Sparse MoE Language Models +1 more

6arXiv · cs.CL·4d ago·source ↗

Expert Tying reduces MoE LLM memory footprint by ~2x with minimal quality loss

Researchers introduce Expert Tying, an architectural modification for Mixture-of-Experts LLMs that shares expert parameters across consecutive transformer layers while keeping routing and attention layer-independent. Evaluated on OLMoE, Qwen3, and DeepSeek-style MoE architectures, the method achieves nearly 2x memory reduction with negligible perplexity or downstream quality degradation. The approach exploits parameter redundancy in MoE pathways to improve the compute-to-memory trade-off for training and inference.

Training Infrastructure Frontier Model Releases DeepSeek V4 Tying the Loop -- Tied Expert Layers in Mixture-of-Experts Language Models Expert Tying +3 more

4arXiv · cs.AI·47h ago·source ↗

Calibrated Mixture-of-Experts under distribution shift: adversarial reweighting approach

A new arXiv preprint analyzes how mixture-of-experts (MoE) models maintain calibration under distribution shift, examining the interaction between routing mechanisms and expert-level calibration. The authors prove that expert calibration is sufficient for overall model calibration in hard-routed MoE but insufficient for soft-routed variants. To address the soft-routing gap, they propose an adversarial reweighting method that penalizes calibration errors of the routed aggregate under distribution shift, demonstrating improved accuracy-calibration tradeoffs across model classes and tasks.

Frontier Model Releases Evaluation and Benchmarking Toward Calibrated Mixture-of-Experts Under Distribution Shift +1 more

5Hugging Face Blog·1mo ago·source ↗

Mixture of Experts Explained

This Hugging Face blog post provides a technical overview of the Mixture of Experts (MoE) architecture, explaining how sparse gating mechanisms route tokens to subsets of expert feed-forward layers to achieve computational efficiency. The post covers training dynamics, inference considerations, and the tradeoffs between dense and sparse models. It serves as a reference document contextualizing MoE's growing relevance following high-profile model releases using the architecture.

Training Infrastructure Frontier Model Releases Mixture of Experts Hugging Face sparse gating +1 more

5arXiv · cs.CL·9d ago·source ↗

Manifold Power Iteration redesigns MoE routers by aligning rows with expert singular directions

A new arXiv preprint proposes Manifold Power Iteration (MPI), a principled redesign of Mixture-of-Experts router matrices that aligns each router row with the principal singular direction of its associated expert. The method uses a 'Power-then-Retract' paradigm to enforce norm constraints while driving convergence toward these singular directions. Empirical validation spans MoE pretraining at scales from 1B to 11B parameters, showing improved model effectiveness.

Training Infrastructure Frontier Model Releases Redesign Mixture-of-Experts Routers with Manifold Power Iteration Manifold Power Iteration

6arXiv · cs.CL·1mo ago·source ↗

ZEDA: Post-Trained MoE Models Can Skip Half Their Experts via Self-Distillation

This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a framework that converts static post-trained Mixture-of-Experts (MoE) language models into dynamic ones without pre-training from scratch. ZEDA injects parameter-free zero-output experts into each MoE layer and uses two-stage self-distillation with the original model as a frozen teacher. Applied to Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks, ZEDA eliminates over 50% of expert FLOPs with marginal accuracy loss and achieves approximately 1.20× end-to-end inference speedup, outperforming the strongest dynamic MoE baseline by 4–6 points.

Training Infrastructure Frontier Model Releases Self-Distillation ZEDA (Zero-Expert Self-Distillation Adaptation)Qwen3-30B-A3B +3 more

6Qwen Research·1mo ago·source ↗

Global-batch Load Balancing for MoE LLM Training from Qwen

Qwen Research introduces a global-batch load balancing technique for Mixture-of-Experts (MoE) LLM training, claiming it is nearly a 'free lunch' improvement. The method addresses expert load imbalance across training batches, a known efficiency and quality bottleneck in MoE architectures. The approach targets the router and expert activation dynamics in transformer-based MoE layers.

Training Infrastructure Frontier Model Releases Global-batch Load Balancing Alibaba Qwen +1 more

5arXiv · cs.CL·4d ago·source ↗

Post-hoc falsification operators for frozen small code models fail to beat Best-of-N in leakage-free evaluation

A measurement study evaluates 26 post-hoc operators (selection, verification, repair, elimination, portfolios) applied to frozen small code models (≤1.5B parameters) against a Best-of-N baseline under a strict leakage-free, matched-compute protocol. None of the semantic operators improves held-out accuracy over BoN, with the failure traced to three structural mechanisms: a coverage wall, a capability scissors, and a near-empty consensus trap. Two non-semantic operators do provide value: an expression-layer recovery method (M1) lifts DeepSeek-Coder-1.3B by +12 tasks on HumanEval+ (p=2.4e-4), and an adaptive consensus early-stop saves ~19% compute with no accuracy harm. The paper's core lesson is that harness quality and coverage measurement should precede investment in semantic post-hoc reasoning.

Evaluation and Benchmarking Inference Economics Selection Without Signal, Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models deepseek-coder Best-of-N +2 more