SDE (Stochastic Differential Equation LR scaling)
sde-stochastic-differential-equation-lr-scaling--893d5f05·1 events·first seen 22d agoAliases: SDE (Stochastic Differential Equation LR scaling)
Co-occurring entities
More like this (12)
Recent events (1)
Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models
Complete-muE is a framework for transferring hyperparameters across dense FFN and Mixture-of-Experts (MoE) transformer architectures, addressing limitations of existing tools like μP and SDE that cannot handle simultaneous architecture and token-per-expert changes. It uses a two-bridge system: Bridge I maps dense FFN to Dense MoE via active-width μP with normalized router scale, and Bridge II maps Dense MoE to sparse MoE via activated-expert scaling with a first-order SDE correction. The practical outcome is a 'tune dense once, transfer to all' recipe that enables near-optimal hyperparameter reuse across MoE configurations without costly re-tuning. Experiments on language model and diffusion model pretraining confirm stable hyperparameter optima across architectures and parameter counts.