5arXiv cs.AI (Artificial Intelligence)·18h ago

Theoretical analysis of generalization scaling laws in quadratic two-layer neural networks

A new arXiv preprint derives explicit characterizations of generalization error as a joint function of model width, sample count, and regularization in a quadratic two-layer network with structured data. The analysis reveals a phase diagram with distinct scaling regimes governed by data-dependent power laws tied to the spectral structure of the target function. The work extends scaling law theory beyond fixed-feature or infinite-width regimes by operating in a finite-sample, feature-learning setting, and characterizes interpolation threshold transitions.

Evaluation and Benchmarking How Width and Data Shape Generalization Scaling Laws in Quadratic Neural Networks

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.LG·24d ago·source ↗

Large deviation analysis shows most interpolating classifiers share the same generalization performance

A new arXiv preprint establishes a large deviation principle characterizing the generalization performance of interpolating linear classifiers in the overparameterized regime (n/d → α, small α). The key result is a concentration phenomenon: all but an exponentially small fraction of interpolators achieve approximately the same generalization error, determined by a unique rate-function maximizer. Empirically, gradient descent and a natural linear program both outperform this typical interpolator, providing theoretical grounding for benign overfitting in overparameterized models.

How abundant are good interpolators?

9Openai Blog·1mo ago·source ↗

Scaling Laws for Neural Language Models

OpenAI published foundational research establishing empirical scaling laws for neural language models, showing that model performance scales predictably with compute, data, and parameters. The work demonstrated power-law relationships between these factors and loss, providing a principled framework for allocating training resources. This paper became a cornerstone of modern large language model development strategy.

Training Infrastructure Frontier Model Releases Jared Kaplan Sam McCandlish OpenAI +3 more

7arXiv · cs.LG·1mo ago·source ↗

Shannon Scaling Law: A Noisy-Channel Framework for LLM Capacity and Non-Monotonic Training Phenomena

Researchers propose the Shannon Scaling Law, a theoretical framework that models LLM training as information transmission over a noisy channel using the Shannon-Hartley theorem. By mapping model parameters to channel bandwidth and training tokens to signal power, the framework introduces a fundamental SNR-based capacity limit that explains non-monotonic phenomena like catastrophic overtraining and quantization-induced degradation that classical power-law scaling laws cannot capture. Validated on Pythia and OLMo2 under Gaussian noise, quantization, and fine-tuning perturbations, the law achieves strong R² scores and successfully extrapolates from 6.9B to 12B parameter models trained on up to 307B tokens. The framework outperforms both classical and perturbation-aware scaling laws, predicting U-shaped performance degradation when SNR is insufficient.

Training Infrastructure Evaluation and Benchmarking Shannon-Hartley Theorem Shannon Scaling Law Pythia +5 more

6arXiv · cs.LG·26d ago·source ↗

Rosetta Neurons follow sublinear power-law scaling with model size, becoming more monosemantic at scale

A new arXiv paper investigates how neuron populations evolve with scale in both language models (up to 30B parameters) and vision models (up to 5B parameters), focusing on 'Rosetta Neurons' — neurons with similar activation patterns across independently trained models. The authors find Rosetta Neurons grow in absolute count but shrink as a fraction of total neurons, and exhibit a 'Neuron Polarization Effect' where they become increasingly monosemantic while non-Rosetta neurons remain less selective. An analytical model explains the sublinear power-law scaling, and the paper demonstrates practical utility via a targeted data-filtering case study for continued pretraining. The results extend scaling laws to neuron-level interpretability structure, linking model size to systematic changes in universality and specialization.

Evaluation and Benchmarking AI Safety Research Rosetta Neurons Neuron Populations Exhibit Divergent Selectivity with Scale Dravid et al., 2023

4arXiv · cs.LG·19d ago·source ↗

Conservation laws from data symmetry in neural network gradient-flow training

A new arXiv preprint investigates whether intrinsic symmetries in training data produce conserved quantities during gradient-flow training of neural networks. The authors prove that for analytic, non-polynomial loss functions, data symmetries generically do not induce additional integrals of motion, but for MSE loss, data augmentation can yield extra conserved quantities. They introduce a framework of 'tensorizable networks'—architectures including linear, polynomial, and Lightning Attention networks—where parameter and input dependence can be separated via an intermediate representation.

Training Infrastructure Lightning Attention Conservation Laws from Data Symmetry in Neural Networks

6Openai Blog·1mo ago·source ↗

How AI Training Scales: Gradient Noise Scale Predicts Batch Parallelizability

OpenAI researchers report that the gradient noise scale — a statistical metric measuring gradient variance relative to mean — reliably predicts the optimal batch size and degree of parallelizability across a wide range of neural network training tasks. The finding suggests that more complex tasks with noisier gradients can benefit from increasingly large batch sizes, removing a potential ceiling on scaling. The work frames training dynamics as a systematic, measurable process rather than empirical art.

Training Infrastructure Frontier Model Releases large-batch training OpenAI gradient noise scale

4arXiv · cs.LG·21d ago·source ↗

Second-order path kernel interpolation formulas extend Domingos' gradient-descent characterization

This paper extends Pedro Domingos' 2020 first-order path-kernel interpolation formula for gradient-descent-trained models to second-order forms. The authors derive curvature-weighted correction terms for standard SGD, an additional sampling-induced component coupling prediction curvature with mini-batch gradient noise covariance, and an extension to SGD with momentum. A concentration estimate for the terminal prediction is also established, quantifying fluctuation around the expected second-order representation.

Pedro Domingos Second-Order Path Kernel Interpolation Formulas in Machine Learning

5arXiv · cs.AI·1mo ago·source ↗

Survey: Approximation Theory for Neural Networks — Classical Results and New Directions Including KANs

This arxiv survey reviews four decades of universal approximation theory for feedforward neural networks, covering classical density results for single-hidden-layer networks and quantitative bounds relating approximation error to network size and target function smoothness. It gives particular emphasis to depth-width trade-offs and the parameter efficiency advantages of deeper architectures for structured function classes. The survey also covers recent theoretical developments on Kolmogorov-Arnold Networks (KANs) as an alternative architectural paradigm with emerging approximation-theoretic analysis.

Evaluation and Benchmarking Feedforward Neural Networks Universal Approximation Theorem depth-width trade-offs +2 more