hyperparameter transfer
hyperparameter-transfer-4129c561·1 events·first seen 26d agoAliases: hyperparameter transfer
Co-occurring entities
More like this (12)
Recent events (1)
Quantifying Hyperparameter Transfer: Embedding Layer Learning Rate as Key Driver of μP Benefits
This paper develops a three-metric framework to quantify hyperparameter transfer quality across model scales, targeting the problem of extrapolating optimal hyperparameters from small to large LLMs. The central empirical finding is that the well-known advantage of Maximal Update Parameterization (μP) over standard parameterization (SP) with AdamW largely reduces to a single factor: the embedding layer learning rate. In SP, the embedding layer acts as a training bottleneck causing instabilities; scaling its learning rate by model width to match μP substantially stabilizes training and improves transfer. The paper also characterizes how weight decay affects scaling law fit quality versus extrapolation robustness in opposite directions.