Large deviation analysis shows most interpolating classifiers share the same generalization performance
A new arXiv preprint establishes a large deviation principle characterizing the generalization performance of interpolating linear classifiers in the overparameterized regime (n/d → α, small α). The key result is a concentration phenomenon: all but an exponentially small fraction of interpolators achieve approximately the same generalization error, determined by a unique rate-function maximizer. Empirically, gradient descent and a natural linear program both outperform this typical interpolator, providing theoretical grounding for benign overfitting in overparameterized models.
Related events (8)
Second-order path kernel interpolation formulas extend Domingos' gradient-descent characterization
This paper extends Pedro Domingos' 2020 first-order path-kernel interpolation formula for gradient-descent-trained models to second-order forms. The authors derive curvature-weighted correction terms for standard SGD, an additional sampling-induced component coupling prediction curvature with mini-batch gradient noise covariance, and an extension to SGD with momentum. A concentration estimate for the terminal prediction is also established, quantifying fluctuation around the expected second-order representation.
Quantifying Hyperparameter Transfer: Embedding Layer Learning Rate as Key Driver of μP Benefits
This paper develops a three-metric framework to quantify hyperparameter transfer quality across model scales, targeting the problem of extrapolating optimal hyperparameters from small to large LLMs. The central empirical finding is that the well-known advantage of Maximal Update Parameterization (μP) over standard parameterization (SP) with AdamW largely reduces to a single factor: the embedding layer learning rate. In SP, the embedding layer acts as a training bottleneck causing instabilities; scaling its learning rate by model width to match μP substantially stabilizes training and improves transfer. The paper also characterizes how weight decay affects scaling law fit quality versus extrapolation robustness in opposite directions.
Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL
This paper investigates whether extrapolative weight averaging of RL-trained checkpoints can extend Pareto frontiers between competing objectives (correctness vs. computational efficiency) without additional training. Starting from a shared initialization, the authors train checkpoints under nested unit-test coverage regimes for competitive programming tasks, revealing a correctness-efficiency frontier where higher-coverage rewards reduce optimization failures but increase correctness failures. Extrapolation beyond trained endpoints produces complementary policies that, when ensembled, improve pass@250 on LCB/hard by 3.3% over the best single checkpoint at matched sample budget. Results hold across 7B and 32B model scales and three inference settings: pure reasoning, tool use, and agentic coding.
Analysis of on-policy distillation reveals sparse, geometrically structured parameter updates
A new arXiv paper analyzes on-policy distillation (OPD) — a post-training method combining on-policy student trajectories with dense teacher supervision — across language and vision-language model pairs. The authors find that OPD updates are coordinate-sparse and distributed across layers (FFN-heavy), and that training only the discovered sparse subnetwork recovers near-full performance. Geometrically, updates are numerically full-rank but spectrally concentrated, falling disproportionately on near-zero weight coordinates, suggesting OPD retains distinct geometric signatures rather than behaving like ordinary dense parameter rewriting.
The Matching Principle: A Geometric Theory Unifying Robustness, Domain Adaptation, and Alignment via Nuisance Covariance
This paper proposes the 'matching principle': a unified geometric framework arguing that robustness methods (CORAL, IRM, adversarial training, augmentation, metric learning, Jacobian penalties, alignment constraints) are all estimators of the same object—the covariance of label-preserving deployment nuisance—and that regularizing the encoder Jacobian along this covariance's range is the core statistical problem. The authors prove closed-form optimality results in a linear-Gaussian model, introduce the Trajectory Deviation Index (TDI) as a label-free embedding sensitivity probe, and validate predictions across 13 pre-registered experimental blocks including Qwen2.5-7B. At 7B scale, matched style-PMH improves selective honesty while standard DPO degrades Style TDI, connecting the theory to alignment safety.
Probabilistic Smoothing with Ratio-Monotone Transforms for Global Optimization
This paper proposes a generalized probabilistic smoothing framework for global optimization that replaces Gaussian kernels with flexible symmetric unimodal kernels combined with monotonic ratio-based transformations. The authors prove that the smoothed objective preserves the global maximizer and that stationary points concentrate near the true optimum under large amplification, without requiring a decreasing smoothing schedule. Explicit complexity bounds for stochastic gradient ascent are derived, and a leave-one-out baseline is shown to provably reduce variance. Experiments on high-dimensional benchmarks and black-box adversarial attacks demonstrate improved robustness over existing methods.
How AI Training Scales: Gradient Noise Scale Predicts Batch Parallelizability
OpenAI researchers report that the gradient noise scale — a statistical metric measuring gradient variance relative to mean — reliably predicts the optimal batch size and degree of parallelizability across a wide range of neural network training tasks. The finding suggests that more complex tasks with noisier gradients can benefit from increasingly large batch sizes, removing a potential ceiling on scaling. The work frames training dynamics as a systematic, measurable process rather than empirical art.
Hyperfitting Explained: Terminal Geometric Expansion in Final Transformer Layers Drives Diversity Gains
This paper investigates the 'hyperfitting' phenomenon—where fine-tuning LLMs to near-zero loss on small datasets improves open-ended generation and reduces repetition—and demonstrates it is mechanistically distinct from temperature scaling. Entropy-matched control experiments falsify both the temperature-equivalence and static vocabulary reweighting hypotheses, instead localizing the effect to a 'Terminal Expansion' in the final transformer block where feature-space dimensionality expands by ~80.8 dimensions, enabling promotion of deep-tail tokens via context-dependent rank reordering. The authors introduce Late-Stage LoRA, a targeted fine-tuning strategy updating only the final 5 layers, achieving robust generation with minimal parameter updates.