Asynchronous pipeline parallelism for LLM pretraining made viable with Muon optimizer and error feedback correction
A new arXiv paper challenges the assumption that gradient staleness in asynchronous pipeline parallelism (specifically PipeDream-2BW) is fundamentally unstable, showing the degradation is optimizer-dependent rather than intrinsic. The authors demonstrate that the Muon optimizer is robust under one-step gradient delay where AdamW fails, and introduce an optimizer-agnostic Error Feedback correction to further close the gap with synchronous training. Experiments on models up to 10B parameters confirm the approach matches synchronous training performance, potentially unlocking higher GPU utilization by eliminating pipeline bubbles.
Related guides (2)
Related events (8)
Open problem paper questions whether AdamW converges under heavy-tailed gradient noise
A preprint from arXiv frames as an open problem whether AdamW, the dominant optimizer for LLM pretraining, can achieve rigorous convergence guarantees under heavy-tailed stochastic gradient noise. The authors note that sign-based optimizers like Lion and Muon already have sharp heavy-tailed convergence rates, while AdamW's second-moment accumulator may create a fundamental obstruction by hiding large gradients. The paper proves a positive weighted-metric benchmark and introduces a corridor lower-bound mechanism to characterize the potential failure mode.
Hamiltonian Probability Gradient Flow Analysis of the Muon Optimizer
This paper develops a rigorous theoretical framework for the Muon optimizer by interpreting its regularized orthogonalization map as the gradient of a Fenchel-dual smoothing of the nuclear norm, identifying Muon updates as mirror/prox steps with momentum as dual coordinates. The authors lift this structure to probability measures over matrix-valued parameters, deriving a mean-field phase-space equation that constitutes a damped Hamiltonian probability dynamics with monotonically decreasing Hamiltonian energy. Exponential convergence rates are established under gradient-dominance and curvature assumptions, and propagation-of-chaos guarantees are provided for the interacting particle system. The framework extends to transformer mixture-of-experts architectures via blockwise Muon probability flows.
RRFP: A Readiness-Driven Runtime for Pipeline-Parallel Training Under Runtime Variability
The paper introduces Runtime-Readiness-First Pipeline (RRFP), a new runtime for pipeline-parallel large-model training that treats schedules as non-binding hint orders rather than strict execution sequences. By combining message-driven asynchronous communication, lightweight tensor-parallel coordination, and ready-set arbitration, RRFP dynamically dispatches work based on actual task readiness, reducing idle bubbles and stage misalignment. Implemented on a Megatron-based framework and evaluated at up to 128 GPUs, RRFP achieves up to 1.77× speedup on language-only workloads and 2.77× on multimodal workloads versus fixed-order baselines, and outperforms the fastest comparable external system by up to 1.84×.
Bebop: MTP with rejection sampling and TV loss achieves 1.8x RL training speedup
Researchers introduce Bebop, a framework for integrating Multi-Token Prediction (MTP) into large-scale RL training pipelines for LLMs. The work identifies that MTP acceptance rates degrade during RL due to entropy fluctuations, and proposes probabilistic rejection sampling plus a novel end-to-end Total Variation (TV) loss that directly optimizes multi-step acceptance rates, achieving up to 95% acceptance rates and 25% extra inference throughput gains. Applied to Qwen3.5, Qwen3.6, and Qwen3.7 models, the method yields up to 1.8x end-to-end acceleration in async RL training. The approach eliminates the need for costly online MTP updating by using pre-RL MTP training with the proposed objectives.
Mechanism-driven internal monitors detect LLM training instability thousands of steps before loss divergence
A new arXiv preprint proposes mechanism-driven monitoring signals derived from the functional roles of critical modules (low-precision flash attention, MoE routers) to detect training instability before it manifests in loss or gradient norms. The authors derive monitors such as spectral entropy of a QK bilinear decomposition and MoE router indicators, showing via fault-injection experiments that these signals trigger thousands of steps ahead of loss divergence. The work targets a high-cost failure mode in frontier LLM training where instability can persist undetected for thousands of steps on expensive accelerator fleets.
PC Layer: Polynomial weight preconditioning for stable LLM pre-training
Researchers propose a PC (preconditioning) layer that applies polynomial preconditioning to reshape the singular-value spectrum of weight matrices during LLM training, improving conditioning stability. The preconditioned weights merge back into the original architecture at inference time with no overhead. Experiments on Llama-1B pre-training show advantages over standard transformers for both AdamW and Muon optimizers, with theoretical convergence guarantees for deep linear networks.
Interpretability-based pipeline for auditing and shaping post-training learning signals
Researchers introduce a data-centric post-training pipeline that applies interpretability methods to preference datasets before optimization, surfacing latent concepts that separate preferred from dispreferred generations. The approach unifies several interpretability-based training protocols as feature or data interventions that shape reward signals. Empirically, the pipeline diagnoses undesirable signals such as sycophancy and over-stylization, mitigates off-target learning, and can amplify desired properties like safety behaviors and model personality. The work reframes post-training from opaque scalar reward optimization into an auditable, concept-level sculpting process.
FlowPipe: LLM-conditioned Generative Flow Networks for automated data preparation pipeline construction
FlowPipe is a new framework that frames ML data preparation pipeline synthesis as conditional probabilistic flow generation over a directed acyclic graph, using Conditional Generative Flow Networks (C-GFlowNets) with a Trajectory Balance objective. LLM-derived semantic priors are injected into the policy via Feature-wise Linear Modulation (FiLM), and a failure-aware flow objective steers search away from invalid states. Evaluated on 74 real-world datasets across two benchmark suites, FlowPipe improves accuracy by 11.96% on average over SOTA baselines and achieves 12.5x faster training convergence. The work addresses long-standing limitations in automated data pipeline construction including weak credit assignment and inefficient exploration.

