Almanac
paper

One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

paperactiveprovisionalone-step-gradient-delay-is-not-a-barrier-for-large-scale-asynchronous-pipeline-parallel-llm-pretraining-6aaa2ed3·1 events·first seen 12h ago

Aliases: One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.LG·12h ago·source ↗

Asynchronous pipeline parallelism for LLM pretraining made viable with Muon optimizer and error feedback correction

A new arXiv paper challenges the assumption that gradient staleness in asynchronous pipeline parallelism (specifically PipeDream-2BW) is fundamentally unstable, showing the degradation is optimizer-dependent rather than intrinsic. The authors demonstrate that the Muon optimizer is robust under one-step gradient delay where AdamW fails, and introduce an optimizer-agnostic Error Feedback correction to further close the gap with synchronous training. Experiments on models up to 10B parameters confirm the approach matches synchronous training performance, potentially unlocking higher GPU utilization by eliminating pipeline bubbles.