Entity · technique

hyperparameter transfer

techniqueactivehyperparameter-transfer-4129c561·1 events·first seen May 21, 2026

Aliases: hyperparameter transfer

Co-occurring entities

embedding layer learning rate AdamW Standard Parameterization (SP)Maximal Update Parameterization (μP)

More like this (12)

Parameter-Efficient Fine-Tuning Cross-Domain Transfer meta-learning supervised fine-tuning parameter noise sim-to-real transfer model calibration large neural network training instruction tuning weak-to-strong generalization Max-Pooling Parameter Golf

Recent events (1)

7arXiv · cs.AI·May 21, 2026·source ↗

Quantifying Hyperparameter Transfer: Embedding Layer Learning Rate as Key Driver of μP Benefits

This paper develops a three-metric framework to quantify hyperparameter transfer quality across model scales, targeting the problem of extrapolating optimal hyperparameters from small to large LLMs. The central empirical finding is that the well-known advantage of Maximal Update Parameterization (μP) over standard parameterization (SP) with AdamW largely reduces to a single factor: the embedding layer learning rate. In SP, the embedding layer acts as a training bottleneck causing instabilities; scaling its learning rate by model width to match μP substantially stabilizes training and improves transfer. The paper also characterizes how weight decay affects scaling law fit quality versus extrapolation robustness in opposite directions.

Training Infrastructure Frontier Model Releases hyperparameter transfer embedding layer learning rate AdamW +2 more