4arXiv cs.LG (Machine Learning)·Jun 5, 2026

Large deviation analysis shows most interpolating classifiers share the same generalization performance

A new arXiv preprint establishes a large deviation principle characterizing the generalization performance of interpolating linear classifiers in the overparameterized regime (n/d → α, small α). The key result is a concentration phenomenon: all but an exponentially small fraction of interpolators achieve approximately the same generalization error, determined by a unique rate-function maximizer. Empirically, gradient descent and a natural linear program both outperform this typical interpolator, providing theoretical grounding for benign overfitting in overparameterized models.

How abundant are good interpolators?

Related events (8)

5arXiv · cs.AI·Jun 29, 2026·source ↗

Theoretical analysis of generalization scaling laws in quadratic two-layer neural networks

A new arXiv preprint derives explicit characterizations of generalization error as a joint function of model width, sample count, and regularization in a quadratic two-layer network with structured data. The analysis reveals a phase diagram with distinct scaling regimes governed by data-dependent power laws tied to the spectral structure of the target function. The work extends scaling law theory beyond fixed-feature or infinite-width regimes by operating in a finite-sample, feature-learning setting, and characterizes interpolation threshold transitions.

Evaluation and Benchmarking How Width and Data Shape Generalization Scaling Laws in Quadratic Neural Networks

5arXiv · cs.LG·Jul 9, 2026·source ↗

Unified framework for generalization and sketching across variable-size inputs via random sampling maps

A new arXiv preprint introduces a theoretical framework for understanding how ML models trained on small inputs generalize to larger, unseen input sizes — covering sequences, graphs, point clouds, and tensors. The approach uses random sampling maps (generalizing sampling with replacement, random binning, and species sampling) to compare inputs of different sizes and derive explicit generalization and sketching rates. The framework applies to transformers, graph neural networks, and moment polynomials, among other architectures. This is a foundational theoretical contribution addressing out-of-distribution generalization across input dimensionality.

Evaluation and Benchmarking permutation-invariant transformers Any-Dimensional Learning by Sampling Graph Neural Networks

3arXiv · cs.LG·6d ago·source ↗

Complexity bounds for learning projected gradient descent solver iterates via k-neighborhood data augmentation

A new arXiv preprint derives Rademacher complexity-based generalization bounds for learning to predict intermediate iterates of projected gradient descent solvers applied to box-constrained quadratic programs. The authors propose a k-neighborhood data collection strategy that augments converged-solution datasets with intermediate solver states, increasing training data without additional solver runs. The work connects to GLENS, a data-efficient global search method, and frames the approach within the Dynamic Data-Driven Application Systems (DDDAS) paradigm for tightening data-model-optimization loops.

GLENS Complexity Bounds and Approaches to Learning Projected Gradient Descent Solver Iterates Rademacher Complexity

4arXiv · cs.AI·5d ago·source ↗

Analysis challenges 'free lunch' narrative for Hyperball-style optimizers in deep network training

A new arXiv preprint investigates why Hyperball-style optimizers (which fix matrix parameter norms and normalize updates) outperform alternatives in large-scale training, finding that their advantage stems primarily from effective learning-rate schedule dynamics rather than an intrinsically superior update direction. The authors introduce an angular effective learning rate framework, decompose updates into radial and tangential components, and show that radial updates have limited direct effect on angular displacement. Experiments with MuonH and MuonWD reveal that careful learning-rate scheduling remains essential even under Hyperball constraints, contradicting the implicit assumption that norm-fixing eliminates scheduling sensitivity.

Training Infrastructure Hyperball May Not Be a Free Lunch MuonH MuonWD

4arXiv · cs.LG·6d ago·source ↗

Mixed-sign spectral regularization via negative-shifted gradient descent for overparameterized linear regression

A new arXiv preprint introduces negative-shifted gradient descent as a method for mixed-sign spectral regularization in overparameterized linear regression, escaping structural limitations of the negative-ridge endpoint. The authors identify a Marchenko-Pastur barrier in a Gaussian spike-plus-flat model and prove that early-stopped paths improve on all admissible endpoints by a polynomial factor in risk under explicit conditions. The main theorem handles general high-effective-rank tails and recovers all head scales simultaneously, with technical control via localized Duhamel integrals and a finite-grid hold-out inequality for validation-selected algorithms.

Evaluation and Benchmarking Beyond Negative-Ridge Endpoints: Mixed-Sign Spectral Regularization via Negative-Shifted Gradient Descent Marchenko-Pastur distribution

4arXiv · cs.LG·Jun 8, 2026·source ↗

Second-order path kernel interpolation formulas extend Domingos' gradient-descent characterization

This paper extends Pedro Domingos' 2020 first-order path-kernel interpolation formula for gradient-descent-trained models to second-order forms. The authors derive curvature-weighted correction terms for standard SGD, an additional sampling-induced component coupling prediction curvature with mini-batch gradient noise covariance, and an extension to SGD with momentum. A concentration estimate for the terminal prediction is also established, quantifying fluctuation around the expected second-order representation.

Pedro Domingos Second-Order Path Kernel Interpolation Formulas in Machine Learning

5arXiv · cs.LG·Jun 19, 2026·source ↗

Optimal deterministic multicalibration achieved, resolving open problem on randomization necessity

A new arXiv preprint resolves an open problem in multicalibration theory by constructing a minimax-optimal multicalibration algorithm that outputs a deterministic predictor, achieving the same O(ε⁻³) sample complexity previously only attainable by randomized predictors. The result extends to outcome indistinguishability, deterministic omnipredictors, and panpredictors with optimal sample complexity, resolving multiple open problems from recent works. Multicalibration is a fairness and reliability property requiring calibration to hold across reweighted subgroups, making this relevant to trustworthy ML research.

Evaluation and Benchmarking AI Safety Research outcome indistinguishability Optimal Deterministic Multicalibration and Omniprediction multicalibration

7arXiv · cs.AI·May 21, 2026·source ↗

Quantifying Hyperparameter Transfer: Embedding Layer Learning Rate as Key Driver of μP Benefits

This paper develops a three-metric framework to quantify hyperparameter transfer quality across model scales, targeting the problem of extrapolating optimal hyperparameters from small to large LLMs. The central empirical finding is that the well-known advantage of Maximal Update Parameterization (μP) over standard parameterization (SP) with AdamW largely reduces to a single factor: the embedding layer learning rate. In SP, the embedding layer acts as a training bottleneck causing instabilities; scaling its learning rate by model width to match μP substantially stabilizes training and improves transfer. The paper also characterizes how weight decay affects scaling law fit quality versus extrapolation robustness in opposite directions.

Training Infrastructure Frontier Model Releases hyperparameter transfer embedding layer learning rate AdamW +2 more

Large deviation analysis shows most interpolating classifiers share the same generalization performance

Related events (8)

5arXiv · cs.AI·Jun 29, 2026·source ↗