Entity · technique

AdamW

techniqueactiveadamw-69fba5db·11 events·first seen May 21, 2026

Aliases: AdamW

Co-occurring entities

More like this (12)

Adam AdamO Andyyyy64 BadWAM BenCzechMark AIME26 AIME24 AI Andrew AIME25 Ada Andrew Wagenmaker Scott Wu

Guides (1)

AdamWConcept

AdamW: The Default Optimizer Powering Modern AI Training

Read asBeginner In-depth

Recent events (11)

5arXiv · cs.LG·2d ago·source ↗

Spectral-norm SAM combined with Muon optimizer achieves strong ImageNet results

A new arXiv preprint introduces a matrix-aware variant of Sharpness-Aware Minimization (SAM) that uses a layerwise spectral inner perturbation for hidden-layer weights, combined with either AdamW/SGDW or the Muon optimizer for the outer update. Experiments on ImageNet-1K with ViT-Small/16 and ResNet-50 show the spectral SAM + Muon combination achieves the best validation accuracy among evaluated methods. The work connects the recently popular Muon optimizer's matrix-structure philosophy to the SAM generalization framework.

Training Infrastructure Evaluation and Benchmarking Sharpness-Aware Minimization AdamW ImageNet +2 more

6arXiv · cs.LG·Jul 22, 2026·source ↗

ISO: Isospectral Optimization framework for RLVR training efficiency and model merging

Researchers introduce Isospectral Optimization (ISO), a framework that exploits 'spectral inheritance' in RLVR-trained language models — the observation that reward-driven adaptation changes singular frames while preserving base model weight spectra. ISO has two instantiations: ISO-Merger, a data-free method for combining specialist models without gradient updates or on-policy distillation, and ISO-Optimizer, which applies standard optimizers (AdamW, Muon) only to frame variables, achieving equivalent accuracy in roughly 2.7x fewer training steps on Qwen3-8B-Base. The work proposes a principled answer to the underexplored optimization layer between reward signals and weight updates in RLVR pipelines.

Frontier Model Releases Alignment and RLHF Qwen3-8B-Base AdamW Isospectral Optimization +2 more

5arXiv · cs.AI·Jul 20, 2026·source ↗

Muon optimizer shows large gains over AdamW in sparse-reward agentic RL on ALFWorld

A new arXiv preprint investigates the Muon optimizer for reinforcement learning post-training of language model agents, comparing it to AdamW on the ALFWorld benchmark using Qwen2.5-0.5B-Instruct. Under Group-in-Group Policy Optimization (GiGPO), applying Muon to hidden weight matrices raises validation success from 0.290 to 0.546 (+88%), with further gains at lower learning rates reaching 0.901 success. The results are exploratory (single-seed, single-task) but suggest that optimizer choice, advantage estimator, and learning rate interact significantly in agentic RL settings.

Agent and Tool Ecosystem Alignment and RLHF ALFWorld GRPO Qwen2.5-7B-Instruct-1M +4 more

6arXiv · cs.CL·Jul 1, 2026·source ↗

Signed-Permutation Gauge Theory for RMSNorm Transformers Improves Coordinate Transport

A new arXiv preprint formalizes the residual-stream gauge symmetry of transformer architectures, showing that RMSNorm models have a signed-permutation gauge group B_d = S_d ⋉ {±1}^d rather than the permutation-only S_d of LayerNorm models. The authors introduce sign-marginalized Hungarian matching and demonstrate that coordinate-preserving transport along fine-tuning trajectories recovers 91.1% of cross-run coordinates versus 60.3% for endpoint matching. Practical consequences include dramatically improved sparse autoencoder reconstruction (NMSE 0.004 vs 1.08), preserved steering vector effects, and correct AdamW optimizer state transfer — with implications for mechanistic interpretability, model merging, and activation engineering.

Evaluation and Benchmarking AI Safety Research AdamW TinyLlama Qwen +2 more

6arXiv · cs.LG·Jun 30, 2026·source ↗

Asynchronous pipeline parallelism for LLM pretraining made viable with Muon optimizer and error feedback correction

A new arXiv paper challenges the assumption that gradient staleness in asynchronous pipeline parallelism (specifically PipeDream-2BW) is fundamentally unstable, showing the degradation is optimizer-dependent rather than intrinsic. The authors demonstrate that the Muon optimizer is robust under one-step gradient delay where AdamW fails, and introduce an optimizer-agnostic Error Feedback correction to further close the gap with synchronous training. Experiments on models up to 10B parameters confirm the approach matches synchronous training performance, potentially unlocking higher GPU utilization by eliminating pipeline bubbles.

Training Infrastructure Inference Economics AdamW One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining Muon +1 more

6arXiv · cs.LG·Jun 23, 2026·source ↗

Open problem paper questions whether AdamW converges under heavy-tailed gradient noise

A preprint from arXiv frames as an open problem whether AdamW, the dominant optimizer for LLM pretraining, can achieve rigorous convergence guarantees under heavy-tailed stochastic gradient noise. The authors note that sign-based optimizers like Lion and Muon already have sharp heavy-tailed convergence rates, while AdamW's second-moment accumulator may create a fundamental obstruction by hiding large gradients. The paper proves a positive weighted-metric benchmark and introduces a corridor lower-bound mechanism to characterize the potential failure mode.

Training Infrastructure Frontier Model Releases AdamW Lion AdaGrad +1 more

5arXiv · cs.LG·Jun 12, 2026·source ↗

Analysis of on-policy distillation reveals sparse, geometrically structured parameter updates

A new arXiv paper analyzes on-policy distillation (OPD) — a post-training method combining on-policy student trajectories with dense teacher supervision — across language and vision-language model pairs. The authors find that OPD updates are coordinate-sparse and distributed across layers (FFN-heavy), and that training only the discovered sparse subnetwork recovers near-full performance. Geometrically, updates are numerically full-rank but spectrally concentrated, falling disproportionately on near-zero weight coordinates, suggesting OPD retains distinct geometric signatures rather than behaving like ordinary dense parameter rewriting.

Evaluation and Benchmarking Alignment and RLHF on-policy distillation AdamW Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

5arXiv · cs.AI·Jun 9, 2026·source ↗

AdamO optimizer and dynamical isometry regularization preserve plasticity in continual learning

A new arXiv preprint connects plasticity loss in continual learning to the empirical Neural Tangent Kernel and identifies dynamical isometry—keeping layer-wise Jacobian singular values near one—as a key mechanism for maintaining learning capacity under non-stationarity. The authors propose an isometry-promoting regularization scheme that can reactivate dormant ReLU units and introduce AdamO, an Adam-style optimizer that decouples isometry regularization from gradient updates analogously to AdamW. The methods are evaluated on supervised and reinforcement-learning continual-learning benchmarks, consistently matching or outperforming prior approaches. The work also reinterprets existing plasticity-preserving methods as targeting only partial isometry measures.

Alignment and RLHF AdamW Neural Tangent Kernel AdamO +1 more

4arXiv · cs.LG·Jun 5, 2026·source ↗

PC Layer: Polynomial weight preconditioning for stable LLM pre-training

Researchers propose a PC (preconditioning) layer that applies polynomial preconditioning to reshape the singular-value spectrum of weight matrices during LLM training, improving conditioning stability. The preconditioned weights merge back into the original architecture at inference time with no overhead. Experiments on Llama-1B pre-training show advantages over standard transformers for both AdamW and Muon optimizers, with theoretical convergence guarantees for deep linear networks.

Training Infrastructure AdamW PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training Llama 1B +1 more

5arXiv · cs.CL·May 26, 2026·source ↗

Mapping the Schedule × Bit-Width Boundary in Sub-100M Quantisation-Aware Training

A large factorial grid study (1345 total runs across two phases) tests whether optimal learning-rate schedules differ by bit-width during from-scratch quantisation-aware training (QAT) for sub-100M decoder language models. The primary hypothesis—that INT6 QAT requires a different schedule than FP16/INT8—is falsified; a 33% warmdown fraction is optimal across all precisions and model sizes from 5M to 350M. For INT4, a regime boundary is identified near 50M parameters: above it, wd33 is decisively optimal; below it, schedule choice falls within seed-level noise. The study also establishes a log-linear scaling law for the INT6 quantisation penalty that successfully predicts held-out model sizes.

Training Infrastructure Open Weights Progress warmdown learning-rate schedule Quantisation-Aware Training (QAT)AdamW +2 more

7arXiv · cs.AI·May 21, 2026·source ↗

Quantifying Hyperparameter Transfer: Embedding Layer Learning Rate as Key Driver of μP Benefits

This paper develops a three-metric framework to quantify hyperparameter transfer quality across model scales, targeting the problem of extrapolating optimal hyperparameters from small to large LLMs. The central empirical finding is that the well-known advantage of Maximal Update Parameterization (μP) over standard parameterization (SP) with AdamW largely reduces to a single factor: the embedding layer learning rate. In SP, the embedding layer acts as a training bottleneck causing instabilities; scaling its learning rate by model width to match μP substantially stabilizes training and improves transfer. The paper also characterizes how weight decay affects scaling law fit quality versus extrapolation robustness in opposite directions.

Training Infrastructure Frontier Model Releases hyperparameter transfer embedding layer learning rate AdamW +2 more