Entity · technique

Maximal Update Parameterization (μP)

techniqueactivemaximal-update-parameterization-p--0f99f296·2 events·first seen May 21, 2026

Aliases: Maximal Update Parameterization (μP), μP (Maximal Update Parametrization)

Co-occurring entities

Transformers Mixture of Experts SDE (Stochastic Differential Equation LR scaling)Complete-muE hyperparameter transfer embedding layer learning rate AdamW Standard Parameterization (SP)

More like this (12)

Parameter-Efficient Fine-Tuning Standard Parameterization (SP)MMLU-Pro Muon Optimizer Max-Pooling u-muP Parameter Golf posterior predictive variance minimization Proximal Policy Optimization MMMU TPU-MLIR MMLU

Recent events (2)

7arXiv · cs.LG·May 25, 2026·source ↗

Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models

Complete-muE is a framework for transferring hyperparameters across dense FFN and Mixture-of-Experts (MoE) transformer architectures, addressing limitations of existing tools like μP and SDE that cannot handle simultaneous architecture and token-per-expert changes. It uses a two-bridge system: Bridge I maps dense FFN to Dense MoE via active-width μP with normalized router scale, and Bridge II maps Dense MoE to sparse MoE via activated-expert scaling with a first-order SDE correction. The practical outcome is a 'tune dense once, transfer to all' recipe that enables near-optimal hyperparameter reuse across MoE configurations without costly re-tuning. Experiments on language model and diffusion model pretraining confirm stable hyperparameter optima across architectures and parameter counts.

Training Infrastructure Frontier Model Releases Transformers Mixture of Experts SDE (Stochastic Differential Equation LR scaling)+3 more

7arXiv · cs.AI·May 21, 2026·source ↗

Quantifying Hyperparameter Transfer: Embedding Layer Learning Rate as Key Driver of μP Benefits

This paper develops a three-metric framework to quantify hyperparameter transfer quality across model scales, targeting the problem of extrapolating optimal hyperparameters from small to large LLMs. The central empirical finding is that the well-known advantage of Maximal Update Parameterization (μP) over standard parameterization (SP) with AdamW largely reduces to a single factor: the embedding layer learning rate. In SP, the embedding layer acts as a training bottleneck causing instabilities; scaling its learning rate by model width to match μP substantially stabilizes training and improves transfer. The paper also characterizes how weight decay affects scaling law fit quality versus extrapolation robustness in opposite directions.

Training Infrastructure Frontier Model Releases hyperparameter transfer embedding layer learning rate AdamW +2 more