Almanac
technique

hyperparameter transfer

techniqueactivehyperparameter-transfer-4129c561·1 events·first seen 26d ago

Aliases: hyperparameter transfer

Co-occurring entities

More like this (12)

Recent events (1)

7arXiv · cs.AI·26d ago·source ↗

Quantifying Hyperparameter Transfer: Embedding Layer Learning Rate as Key Driver of μP Benefits

This paper develops a three-metric framework to quantify hyperparameter transfer quality across model scales, targeting the problem of extrapolating optimal hyperparameters from small to large LLMs. The central empirical finding is that the well-known advantage of Maximal Update Parameterization (μP) over standard parameterization (SP) with AdamW largely reduces to a single factor: the embedding layer learning rate. In SP, the embedding layer acts as a training bottleneck causing instabilities; scaling its learning rate by model width to match μP substantially stabilizes training and improves transfer. The paper also characterizes how weight decay affects scaling law fit quality versus extrapolation robustness in opposite directions.