5arXiv cs.LG (Machine Learning)·18d ago

Optimal Mixture Transport (OMT): Biconvex Formulation for Scalable, Stable Optimal Transport

This paper introduces Optimal Mixture Transport (OMT), a framework that reformulates optimal transport between probability distributions as a strictly biconvex optimization problem with a provably unique global minimizer. By operating at the level of mixture components (modeled as exponential-family distributions) rather than individual samples, OMT decouples computational complexity from sample size. The authors provide theoretical stability guarantees showing bounded perturbations yield bounded changes in transport plans, and validate the approach on image data and large-scale single-cell RNA sequencing datasets.

Training Infrastructure Evaluation and Benchmarking Optimal Transport Single-Cell RNA Sequencing Biconvex Optimization Optimal Mixture Transport (OMT)Exponential Family Distributions

Related guides (2)

Training InfrastructureTopic guide

Training Infrastructure: The Compute Arms Race Powering Modern AI

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.LG·24d ago·source ↗

Probabilistic Smoothing with Ratio-Monotone Transforms for Global Optimization

This paper proposes a generalized probabilistic smoothing framework for global optimization that replaces Gaussian kernels with flexible symmetric unimodal kernels combined with monotonic ratio-based transformations. The authors prove that the smoothed objective preserves the global maximizer and that stationary points concentrate near the true optimum under large amplification, without requiring a decreasing smoothing schedule. Explicit complexity bounds for stochastic gradient ascent are derived, and a leave-one-out baseline is shown to provably reduce variance. Experiments on high-dimensional benchmarks and black-box adversarial attacks demonstrate improved robustness over existing methods.

AI Safety Research stochastic gradient ascent Probabilistic Smoothing with Ratio-Monotone Transforms Gaussian kernel smoothing +2 more

6arXiv · cs.LG·26d ago·source ↗

Hamiltonian Probability Gradient Flow Analysis of the Muon Optimizer

This paper develops a rigorous theoretical framework for the Muon optimizer by interpreting its regularized orthogonalization map as the gradient of a Fenchel-dual smoothing of the nuclear norm, identifying Muon updates as mirror/prox steps with momentum as dual coordinates. The authors lift this structure to probability measures over matrix-valued parameters, deriving a mean-field phase-space equation that constitutes a damped Hamiltonian probability dynamics with monotonically decreasing Hamiltonian energy. Exponential convergence rates are established under gradient-dominance and curvature assumptions, and propagation-of-chaos guarantees are provided for the interacting particle system. The framework extends to transformer mixture-of-experts architectures via blockwise Muon probability flows.

Training Infrastructure Frontier Model Releases Fenchel duality mirror descent Mixture of Experts +4 more

5Hugging Face Blog·1mo ago·source ↗

Introducing Optimum: The Optimization Toolkit for Transformers at Scale

Hugging Face announced Optimum, an optimization toolkit designed to accelerate Transformers models on various hardware backends. The toolkit aims to bridge the gap between Transformers model development and hardware-specific optimizations from partners. It provides a unified interface for quantization, pruning, and hardware-accelerated inference across different accelerators.

Inference Economics Enterprise Deployment Patterns Transformers Optimum Hugging Face +1 more

7arXiv · cs.LG·26d ago·source ↗

Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models

Complete-muE is a framework for transferring hyperparameters across dense FFN and Mixture-of-Experts (MoE) transformer architectures, addressing limitations of existing tools like μP and SDE that cannot handle simultaneous architecture and token-per-expert changes. It uses a two-bridge system: Bridge I maps dense FFN to Dense MoE via active-width μP with normalized router scale, and Bridge II maps Dense MoE to sparse MoE via activated-expert scaling with a first-order SDE correction. The practical outcome is a 'tune dense once, transfer to all' recipe that enables near-optimal hyperparameter reuse across MoE configurations without costly re-tuning. Experiments on language model and diffusion model pretraining confirm stable hyperparameter optima across architectures and parameter counts.

Training Infrastructure Frontier Model Releases Transformers Mixture of Experts SDE (Stochastic Differential Equation LR scaling)+3 more

5arXiv · cs.LG·3d ago·source ↗

Kolmogorov Regression lifts diffusion policies to Cameron-Martin space for robust long-horizon control

Researchers introduce a backward Kolmogorov equation framework that reformulates diffusion policy training as a deterministic boundary-value PDE problem in Cameron-Martin space, replacing stochastic score matching. The approach uses a precision-weighted Cameron-Martin loss and a Kolmogorov residual as an inference-time failure detector, yielding convergence guarantees tied to kernel effective rank rather than action dimension. Validation on the PushT manipulation benchmark shows 17% improvement in episode reward and 67.6% reduction in inter-step drift; a 6-station manufacturing scheduling task shows 28.4% lower RMSE than LSTM baselines and 96% reduction in deadlock events via Hamilton-Jacobi reachability certification.

Agent and Tool Ecosystem Hamilton-Jacobi reachability Kolmogorov Regression for Robust Diffusion Policies PushT +1 more

6Qwen Research·1mo ago·source ↗

Global-batch Load Balancing for MoE LLM Training from Qwen

Qwen Research introduces a global-batch load balancing technique for Mixture-of-Experts (MoE) LLM training, claiming it is nearly a 'free lunch' improvement. The method addresses expert load imbalance across training batches, a known efficiency and quality bottleneck in MoE architectures. The approach targets the router and expert activation dynamics in transformer-based MoE layers.

Training Infrastructure Frontier Model Releases Global-batch Load Balancing Alibaba Qwen +1 more

4Hugging Face Blog·1mo ago·source ↗

Optimizing Stable Diffusion for Intel CPUs with NNCF and Hugging Face Optimum

This Hugging Face blog post details techniques for optimizing Stable Diffusion inference on Intel CPUs using Neural Network Compression Framework (NNCF) and the Optimum library. The workflow covers quantization and other compression methods to reduce latency and memory footprint on CPU hardware. This is relevant to the inference-economics and enterprise-deployment threads as it addresses running diffusion models without dedicated GPU hardware.

Inference Economics Enterprise Deployment Patterns Stable Diffusion 3 Hugging Face Hugging Face Optimum +2 more

4arXiv · cs.AI·11d ago·source ↗

PTL-Diffusion: Diffusion framework with periodic terminal laws for manifold-aware generation

PTL-Diffusion is a new diffusion modeling framework that replaces the standard single Gaussian terminal distribution with a periodic family of Gaussian terminal laws, embedding phase structure directly into the forward noising dynamics rather than only in the denoising network. The authors derive closed-form forward marginals and reverse posteriors for a periodically forced Ornstein-Uhlenbeck process, enabling standard noise-prediction training. Experiments on torus, cylinder, and face datasets show improvements in manifold-level distributional matching over DDPM baselines. The work is a proof-of-concept motivating structured terminal reference laws as a direction for geometry-aware generative modeling.

Evaluation and Benchmarking Denoising Diffusion Probabilistic Models Olivetti Faces Dataset PTL-Diffusion