5arXiv cs.AI (Artificial Intelligence)·1mo ago

Survey: Approximation Theory for Neural Networks — Classical Results and New Directions Including KANs

This arxiv survey reviews four decades of universal approximation theory for feedforward neural networks, covering classical density results for single-hidden-layer networks and quantitative bounds relating approximation error to network size and target function smoothness. It gives particular emphasis to depth-width trade-offs and the parameter efficiency advantages of deeper architectures for structured function classes. The survey also covers recent theoretical developments on Kolmogorov-Arnold Networks (KANs) as an alternative architectural paradigm with emerging approximation-theoretic analysis.

Evaluation and Benchmarking Feedforward Neural Networks Universal Approximation Theorem depth-width trade-offs Sobolev Spaces Kolmogorov-Arnold Networks (KANs)

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.AI·25d ago·source ↗

Neuronal Stochastic Attention Circuit (NSAC) for Probabilistic Representation Learning

Researchers introduce NSAC, a biologically-inspired continuous-time attention architecture that models attention logits as solutions to an Ornstein-Uhlenbeck stochastic differential equation, drawing on C. elegans Neuronal Circuit Policy wiring to induce Gaussian distributions over attention weights. The architecture enables joint quantification of aleatoric and epistemic uncertainty via a two-term objective combining Gaussian negative log-likelihood with an epistemic-separation regularizer. Empirical evaluation spans irregular time-series function approximation, multivariate regression, long-range forecasting, Industry 4.0 tasks, and autonomous vehicle lane-keeping, showing competitive accuracy with well-calibrated uncertainty estimates.

AI Safety Research Neuronal Stochastic Attention Circuit (NSAC)Neuronal Circuit Policies (NCPs)logistic-normal distribution +3 more

4Hugging Face Blog·1mo ago·source ↗

Nyströmformer: Approximating Self-Attention in Linear Time and Memory via the Nyström Method

This Hugging Face blog post covers Nyströmformer, a transformer variant that approximates standard self-attention using the Nyström method to achieve linear time and memory complexity. The approach addresses the quadratic scaling bottleneck of standard attention, enabling processing of longer sequences at reduced computational cost. The post likely covers the model's integration into the Hugging Face ecosystem and its practical use cases.

Long Context Evolution Inference Economics Nyströmformer Nyström method Hugging Face +1 more

6arXiv · cs.CL·17d ago·source ↗

Dynamic short convolutions yield 1.33–1.60× compute advantage over standard Transformers

A new arXiv preprint introduces dynamic short convolutions as an architectural primitive for Transformers, using input-dependent filters to combine locality bias with increased expressivity. Experiments across 150M–2B parameter language models show consistent perplexity improvements over standard Transformers and static convolution variants, with scaling-law fits indicating a 1.33× compute advantage when applied to key/query/value vectors and 1.60× when added after every linear layer. The technique also improves linear RNNs (Mamba-2, Gated DeltaNet) and mixture-of-experts architectures, with custom Triton kernels making training practical.

Training Infrastructure Frontier Model Releases Triton Mamba Gated DeltaNet-2 +1 more

5arXiv · cs.LG·19d ago·source ↗

Tight Convergence Theory for Error Feedback Algorithms in Distributed Optimization

This paper provides tight convergence analyses for two major error-feedback algorithms—classic Error Feedback (EF) and Error Feedback 21 (EF21)—used to mitigate communication bottlenecks in distributed learning. The authors identify optimal step-size choices and construct tailored Lyapunov functions for each method, yielding guarantees that hold independently of the number of agents and recover the best known single-agent bounds. The work clarifies the relative performance of these gradient compression variants, which has remained poorly understood despite widespread use.

Training Infrastructure Inference Economics Error Feedback 21 (EF21)Error Feedback (EF)Lyapunov function +2 more

4arXiv · cs.LG·18d ago·source ↗

Expressivity Limits of Congruence-Based Architectures for Neural Networks on Positive-Definite Matrices

This paper analyzes neural network architectures designed to classify symmetric positive-definite (SPD) matrices, focusing on congruence-like layers as used in SPDNet. The authors prove that imposing semi-orthogonality constraints on weight matrices limits expressivity, causing deep architectures to collapse to single-hidden-layer equivalents due to spectral diversity loss—a consequence of Poincaré's separation theorem. The work also compares Riemannian classifiers for compatibility with congruence-based feature maps.

Evaluation and Benchmarking congruence-based layers SPDNet Poincaré separation theorem +2 more

6Openai Blog·1mo ago·source ↗

Understanding Neural Networks Through Sparse Circuits

OpenAI has published work on mechanistic interpretability using a sparse model approach aimed at understanding how neural networks reason internally. The research seeks to make AI systems more transparent by identifying sparse circuits within neural networks. This work is positioned as supporting safer and more reliable AI behavior through improved interpretability.

Evaluation and Benchmarking AI Safety Research Sparse Circuits mechanistic interpretability OpenAI

6arXiv · cs.LG·11d ago·source ↗

Express: Efficient causal attention approximation with formal guarantees and FlashAttention 2 speedups

A new tool called Express converts non-causal attention approximations into causal ones with matching theoretical guarantees, achieving log^(3/2)(n)/s approximation error with O(s) memory. Combined with the Thinformer approximation and an I/O-aware Triton implementation, it demonstrates substantial speedups over FlashAttention 2. The work targets four practical bottlenecks: long-context prefill, KV cache compression, and both memory- and compute-constrained long-form decoding.

Training Infrastructure Long Context Evolution Triton Thinformer FlashAttention 2 +2 more

3Import Ai·1mo ago·source ↗

Import AI 439: AI kernels, decentralized training, and universal representations

Import AI issue 439 covers topics including AI kernels, decentralized training approaches, and universal representations in neural networks. The newsletter also touches on philosophical questions about how a hypothetical superintelligence might internally represent abstract concepts like a soul. As a tier-2 commentary source, this issue aggregates and contextualizes recent AI/ML developments across research and infrastructure themes.

Training Infrastructure Agent and Tool Ecosystem Jack Clark Import AI