Almanac
← Events
4arXiv cs.LG (Machine Learning)·12h ago

SOAP and Muon optimizers outperform Adam for training machine learning interatomic potentials

A new arXiv preprint systematically compares matrix-structured optimizers—Muon, SOAP, and SOAP-Muon—against Adam for training machine learning interatomic potentials (MLIPs), specifically NequIP and Allegro models. SOAP and SOAP-Muon consistently outperform Adam in both convergence speed and final accuracy, with gains especially pronounced under partial force supervision. The paper argues that optimizer choice is an underexplored but impactful design axis for scientific ML models.

Related guides (1)

Related events (8)

6arXiv · cs.LG·3d ago·source ↗

Asynchronous pipeline parallelism for LLM pretraining made viable with Muon optimizer and error feedback correction

A new arXiv paper challenges the assumption that gradient staleness in asynchronous pipeline parallelism (specifically PipeDream-2BW) is fundamentally unstable, showing the degradation is optimizer-dependent rather than intrinsic. The authors demonstrate that the Muon optimizer is robust under one-step gradient delay where AdamW fails, and introduce an optimizer-agnostic Error Feedback correction to further close the gap with synchronous training. Experiments on models up to 10B parameters confirm the approach matches synchronous training performance, potentially unlocking higher GPU utilization by eliminating pipeline bubbles.

6arXiv · cs.LG·1mo ago·source ↗

Hamiltonian Probability Gradient Flow Analysis of the Muon Optimizer

This paper develops a rigorous theoretical framework for the Muon optimizer by interpreting its regularized orthogonalization map as the gradient of a Fenchel-dual smoothing of the nuclear norm, identifying Muon updates as mirror/prox steps with momentum as dual coordinates. The authors lift this structure to probability measures over matrix-valued parameters, deriving a mean-field phase-space equation that constitutes a damped Hamiltonian probability dynamics with monotonically decreasing Hamiltonian energy. Exponential convergence rates are established under gradient-dominance and curvature assumptions, and propagation-of-chaos guarantees are provided for the interacting particle system. The framework extends to transformer mixture-of-experts architectures via blockwise Muon probability flows.

6arXiv · cs.LG·10d ago·source ↗

Open problem paper questions whether AdamW converges under heavy-tailed gradient noise

A preprint from arXiv frames as an open problem whether AdamW, the dominant optimizer for LLM pretraining, can achieve rigorous convergence guarantees under heavy-tailed stochastic gradient noise. The authors note that sign-based optimizers like Lion and Muon already have sharp heavy-tailed convergence rates, while AdamW's second-moment accumulator may create a fundamental obstruction by hiding large gradients. The paper proves a positive weighted-metric benchmark and introduces a corridor lower-bound mechanism to characterize the potential failure mode.

5arXiv · cs.LG·10d ago·source ↗

MAS-PromptBench: Systematic study of prompt optimization in multi-agent LLM systems

A new arXiv preprint introduces MAS-PromptBench, a benchmark and study examining when and how much system-prompt optimization improves multi-agent LLM systems (MAS). The authors evaluate two prompt optimizers across diverse MAS configurations varying in task, workflow, communication protocol, and team size. Results show prompt optimization can unlock significant gains but also expose open challenges, particularly around the exponentially growing search space as agent count increases.

4Github Trending·1mo ago·source ↗

Unsloth: Web UI and Library for Efficient Fine-tuning of Open Models

Unsloth is an open-source Python library and web UI (Unsloth Studio) for efficient fine-tuning and local inference of open-weight models including Gemma 4, Qwen3, DeepSeek, and GPT-OSS variants. The project has accumulated over 64,000 GitHub stars with continued daily growth (+139 today), indicating strong community adoption. It targets practitioners who want to train and run large models locally with reduced memory and compute requirements.

7arXiv · cs.AI·1mo ago·source ↗

SkillOpt: Systematic Text-Space Optimizer for Self-Evolving Agent Skills

SkillOpt introduces a principled optimization framework for agent skills, treating the skill document as an external trainable state analogous to model weights. A separate optimizer model converts scored rollouts into bounded edits (add/delete/replace) on a skill document, accepting only edits that improve held-out validation scores. Evaluated across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt achieves best or tied performance on all 52 evaluated cells, lifting GPT-5.5 no-skill accuracy by up to +24.8 points inside the Codex agentic loop. Optimized skill artifacts also transfer across model scales and execution environments without further optimization.

4arXiv · cs.LG·36h ago·source ↗

Empirical comparison finds quantum ML models do not yet surpass classical baselines

A new arXiv preprint presents a systematic empirical comparison of seven quantum machine learning (QML) model pairs against classical counterparts across supervised learning and reinforcement learning tasks. Results show QML models do not yet surpass classical baselines in prediction performance, policy stability, or training time, though some promise is noted for noise filtering and false positive control. The study identifies open challenges in hardware environments, training efficiency, and convergence stability, and releases code publicly.

5arXiv · cs.CL·22d ago·source ↗

Manifold Power Iteration redesigns MoE routers by aligning rows with expert singular directions

A new arXiv preprint proposes Manifold Power Iteration (MPI), a principled redesign of Mixture-of-Experts router matrices that aligns each router row with the principal singular direction of its associated expert. The method uses a 'Power-then-Retract' paradigm to enforce norm constraints while driving convergence toward these singular directions. Empirical validation spans MoE pretraining at scales from 1B to 11B parameters, showing improved model effectiveness.