Entity · technique

SDE (Stochastic Differential Equation LR scaling)

techniqueactivesde-stochastic-differential-equation-lr-scaling--893d5f05·1 events·first seen May 25, 2026

Aliases: SDE (Stochastic Differential Equation LR scaling)

Co-occurring entities

Transformers Mixture of Experts Complete-muE Maximal Update Parameterization (μP)

More like this (12)

Ornstein-Uhlenbeck stochastic differential equation stochastic-deterministic boundary (SDB)Neural Ordinary Differential Equations Conditional Scale Entropy DeepScaleR stochastic gradient ascent ESM (Evolutionary Scale Modeling)Survival Diffusion Probabilistic Model (SDPM)gradient noise scale Will Scaling Improve Social Simulation with LLMs?birth-death Langevin dynamics discrete diffusion models

Recent events (1)

7arXiv · cs.LG·May 25, 2026·source ↗

Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models

Complete-muE is a framework for transferring hyperparameters across dense FFN and Mixture-of-Experts (MoE) transformer architectures, addressing limitations of existing tools like μP and SDE that cannot handle simultaneous architecture and token-per-expert changes. It uses a two-bridge system: Bridge I maps dense FFN to Dense MoE via active-width μP with normalized router scale, and Bridge II maps Dense MoE to sparse MoE via activated-expert scaling with a first-order SDE correction. The practical outcome is a 'tune dense once, transfer to all' recipe that enables near-optimal hyperparameter reuse across MoE configurations without costly re-tuning. Experiments on language model and diffusion model pretraining confirm stable hyperparameter optima across architectures and parameter counts.

Training Infrastructure Frontier Model Releases Transformers Mixture of Experts SDE (Stochastic Differential Equation LR scaling)+3 more