paper
One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining
paperactiveprovisional
one-step-gradient-delay-is-not-a-barrier-for-large-scale-asynchronous-pipeline-parallel-llm-pretraining-6aaa2ed3·1 events·first seen 12h agoAliases: One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining
Co-occurring entities
More like this (12)
Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMsPC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-TrainingPretraining Recurrent Networks without RecurrenceAccelerated Decentralized Stochastic Gradient Descent for Strongly Convex Optimizationstochastic gradient ascenttemporally ordered pre-trainingq0: Primitives for Hyper-Epoch PretrainingMulti-Gossip Accelerated DSGDlarge neural network trainingFlashbackCL: Mitigating Temporal Forgetting in Federated LearningSelf-Supervised PretrainingTailLoR: Protecting Principal Components in Parameter-Efficient Continual Learning
Recent events (1)
Asynchronous pipeline parallelism for LLM pretraining made viable with Muon optimizer and error feedback correction
A new arXiv paper challenges the assumption that gradient staleness in asynchronous pipeline parallelism (specifically PipeDream-2BW) is fundamentally unstable, showing the degradation is optimizer-dependent rather than intrinsic. The authors demonstrate that the Muon optimizer is robust under one-step gradient delay where AdamW fails, and introduce an optimizer-agnostic Error Feedback correction to further close the gap with synchronous training. Experiments on models up to 10B parameters confirm the approach matches synchronous training performance, potentially unlocking higher GPU utilization by eliminating pipeline bubbles.