paper

One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

paperactiveprovisionalone-step-gradient-delay-is-not-a-barrier-for-large-scale-asynchronous-pipeline-parallel-llm-pretraining-6aaa2ed3·1 events·first seen 12h ago

Aliases: One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

Co-occurring entities

AdamW Muon PipeDream-2BW

More like this (12)

Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training Pretraining Recurrent Networks without Recurrence Accelerated Decentralized Stochastic Gradient Descent for Strongly Convex Optimization stochastic gradient ascent temporally ordered pre-training q0: Primitives for Hyper-Epoch Pretraining Multi-Gossip Accelerated DSGD large neural network training FlashbackCL: Mitigating Temporal Forgetting in Federated Learning Self-Supervised Pretraining TailLoR: Protecting Principal Components in Parameter-Efficient Continual Learning

Recent events (1)

6arXiv · cs.LG·12h ago·source ↗

Asynchronous pipeline parallelism for LLM pretraining made viable with Muon optimizer and error feedback correction

A new arXiv paper challenges the assumption that gradient staleness in asynchronous pipeline parallelism (specifically PipeDream-2BW) is fundamentally unstable, showing the degradation is optimizer-dependent rather than intrinsic. The authors demonstrate that the Muon optimizer is robust under one-step gradient delay where AdamW fails, and introduce an optimizer-agnostic Error Feedback correction to further close the gap with synchronous training. Experiments on models up to 10B parameters confirm the approach matches synchronous training performance, potentially unlocking higher GPU utilization by eliminating pipeline bubbles.

Training Infrastructure Inference Economics AdamW One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining Muon +1 more