Almanac
technique

AdaGrad

techniqueactiveprovisionaladagrad-82f086c3·1 events·first seen 2d ago

Aliases: AdaGrad

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.LG·2d ago·source ↗

Open problem paper questions whether AdamW converges under heavy-tailed gradient noise

A preprint from arXiv frames as an open problem whether AdamW, the dominant optimizer for LLM pretraining, can achieve rigorous convergence guarantees under heavy-tailed stochastic gradient noise. The authors note that sign-based optimizers like Lion and Muon already have sharp heavy-tailed convergence rates, while AdamW's second-moment accumulator may create a fundamental obstruction by hiding large gradients. The paper proves a positive weighted-metric benchmark and introduces a corridor lower-bound mechanism to characterize the potential failure mode.