4arXiv cs.LG (Machine Learning)·12d ago

MG-ADSGD achieves optimal communication complexity for decentralized stochastic strongly convex optimization

Researchers propose Multi-Gossip Accelerated DSGD (MG-ADSGD), a decentralized stochastic optimization algorithm that simultaneously achieves accelerated dependence on both the condition number (√κ) and the network spectral gap (1/√(1-β)), a combination no prior stochastic method had attained. The algorithm couples gossip depth with mini-batch size so that additional communication rounds improve both consensus accuracy and gradient variance reduction. The resulting communication complexity is claimed to be the best currently known for decentralized stochastic strongly convex optimization up to logarithmic factors.

Training Infrastructure Multi-Gossip Accelerated DSGD Accelerated Decentralized Stochastic Gradient Descent for Strongly Convex Optimization

Related guides (1)

Training InfrastructureTopic guide

Training Infrastructure: The Compute Arms Race Powering Modern AI

Read asBeginner In-depth

Related events (8)

6arXiv · cs.LG·24d ago·source ↗

GADD: Gibbs-Accelerated Discrete Diffusion Achieves Polylog Sampling Complexity

This paper introduces Gibbs-Accelerated Discrete Diffusion (GADD), a corrector method for uniform-rate discrete diffusion models that constructs Gibbs posterior likelihoods directly from the concrete score function without additional training. GADD achieves O(polylog(ε⁻¹)) sampling complexity, the first such rate for diffusion-based samplers in this setting. Experiments on synthetic data, zero-shot text sampling, and zero-shot conditional music generation show consistent improvements in sample quality and wall-clock efficiency over Euler and CTMC baselines. The work also introduces a novel induction-based theoretical framework for analyzing predictor-corrector methods in discrete diffusion.

Evaluation and Benchmarking Inference Economics Gibbs-Accelerated Discrete Diffusion (GADD)predictor-corrector methods discrete diffusion models +2 more

5arXiv · cs.CL·11d ago·source ↗

ADAS: Attention-Discounted Adaptive Sampler improves parallel decoding for masked diffusion language models

Researchers propose ADAS, a training-free reranking rule for masked diffusion language model decoding that addresses token interaction failures in parallel token commitment. The method greedily penalizes candidates that attend strongly to already-selected uncertain positions, using attention weights as soft marginal penalties rather than hard constraints. Evaluated on LLaDA-8B-Base and Dream-7B-Base across GSM8K, MATH500, HumanEval, and MBPP, ADAS improves low-NFE performance by 9–10 percentage points on average when plugged into existing samplers with only 3.1% runtime overhead.

Frontier Model Releases Inference Economics LLaDA-8B-Base MATH500 EB-Sampler +6 more

5arXiv · cs.CL·11d ago·source ↗

GGRO: Gradient-Guided Reward Optimization for inference-time LLM alignment

Researchers introduce Gradient-Guided Reward Optimization (GGRO), an inference-time alignment method that uses gradient signals from a reward model to inject 'nudging tokens' at high-uncertainty decoding steps, rather than relying on sampling-intensive re-ranking approaches like Best-of-N. The method monitors token-level entropy to detect distribution drift and steers generation trajectories directly, claiming improved robustness to reward hacking with minimal computational overhead. Experiments show gains across safety, helpfulness, and reasoning benchmarks compared to standard inference-time alignment baselines.

Inference Economics Alignment and RLHF Best-of-N Sampling Gradient-Guided Reward Optimization

5arXiv · cs.CL·9d ago·source ↗

AGDO: Attention-guided denoising and optimization framework improves diffusion language model reasoning

Researchers propose AGDO, a framework that replaces random masking in diffusion large language models (dLLMs) with attention-guided denoising order and token weighting during fine-tuning and reinforcement learning. The work is motivated by an empirical finding that tokens with stronger attention to unmasked context are more stable and critical for reasoning. Experiments on math and coding benchmarks show AGDO outperforms existing post-training methods for dLLMs, advancing the case for attention-aware training in parallel-decoding language models.

Alignment and RLHF AGDO Beyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language Models

5arXiv · cs.LG·3d ago·source ↗

Kolmogorov Regression lifts diffusion policies to Cameron-Martin space for robust long-horizon control

Researchers introduce a backward Kolmogorov equation framework that reformulates diffusion policy training as a deterministic boundary-value PDE problem in Cameron-Martin space, replacing stochastic score matching. The approach uses a precision-weighted Cameron-Martin loss and a Kolmogorov residual as an inference-time failure detector, yielding convergence guarantees tied to kernel effective rank rather than action dimension. Validation on the PushT manipulation benchmark shows 17% improvement in episode reward and 67.6% reduction in inter-step drift; a 6-station manufacturing scheduling task shows 28.4% lower RMSE than LSTM baselines and 96% reduction in deadlock events via Hamilton-Jacobi reachability certification.

Agent and Tool Ecosystem Hamilton-Jacobi reachability Kolmogorov Regression for Robust Diffusion Policies PushT +1 more

5arXiv · cs.LG·25d ago·source ↗

GoBOED: Goal-Driven Bayesian Optimal Experimental Design for Decision-Focused Robustness

GoBOED is a new framework for Bayesian optimal experimental design (BOED) that replaces information-gain maximization with direct optimization for a specified downstream decision objective. It combines an amortized variational posterior surrogate with a differentiable convex decision layer to enable gradient-based, decision-focused design optimization. The authors prove that GoBOED gradients are insensitive to parameter directions irrelevant to the decision goal, formally justifying why goal-driven design achieves equivalent decision quality over a wider range of experimental designs. Empirical results across source localization, epidemic management, and pharmacokinetic control show improved alignment with decision objectives compared to goal-agnostic BOED.

Evaluation and Benchmarking Agent and Tool Ecosystem GoBOED differentiable convex optimization amortized variational inference +1 more

6arXiv · cs.AI·23d ago·source ↗

Skill-Conditioned Gated Self-Distillation (SGSD) for LLM Reasoning

SGSD is a new on-policy self-distillation method for LLM reasoning that replaces trusted privileged information (e.g., reference answers) with an experience-derived skill bank of skill-mistake pairs. It constructs a multi-teacher pool, validates each teacher's contribution via a verifier, and applies a gated objective to distill informative disagreements while suppressing noisy signals. On Qwen3-1.7B, SGSD outperforms GRPO by 6.2% and answer-conditioned OPSD by 1.7% on average across AIME24, AIME25, and HMMT25. The method relaxes the assumption of trusted privileged information, making self-distillation more practical under weaker supervision.

Frontier Model Releases Evaluation and Benchmarking OPSD AIME24 SGSD +7 more

5arXiv · cs.LG·19d ago·source ↗

Tight Convergence Theory for Error Feedback Algorithms in Distributed Optimization

This paper provides tight convergence analyses for two major error-feedback algorithms—classic Error Feedback (EF) and Error Feedback 21 (EF21)—used to mitigate communication bottlenecks in distributed learning. The authors identify optimal step-size choices and construct tailored Lyapunov functions for each method, yielding guarantees that hold independently of the number of agents and recover the best known single-agent bounds. The work clarifies the relative performance of these gradient compression variants, which has remained poorly understood despite widespread use.

Training Infrastructure Inference Economics Error Feedback 21 (EF21)Error Feedback (EF)Lyapunov function +2 more