Almanac
technique

SDE (Stochastic Differential Equation LR scaling)

techniqueactiveprovisionalsde-stochastic-differential-equation-lr-scaling--893d5f05·1 events·first seen 22d ago

Aliases: SDE (Stochastic Differential Equation LR scaling)

Co-occurring entities

More like this (12)

Recent events (1)

7arXiv · cs.LG·22d ago·source ↗

Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models

Complete-muE is a framework for transferring hyperparameters across dense FFN and Mixture-of-Experts (MoE) transformer architectures, addressing limitations of existing tools like μP and SDE that cannot handle simultaneous architecture and token-per-expert changes. It uses a two-bridge system: Bridge I maps dense FFN to Dense MoE via active-width μP with normalized router scale, and Bridge II maps Dense MoE to sparse MoE via activated-expert scaling with a first-order SDE correction. The practical outcome is a 'tune dense once, transfer to all' recipe that enables near-optimal hyperparameter reuse across MoE configurations without costly re-tuning. Experiments on language model and diffusion model pretraining confirm stable hyperparameter optima across architectures and parameter counts.