Survey: Approximation Theory for Neural Networks — Classical Results and New Directions Including KANs
This arxiv survey reviews four decades of universal approximation theory for feedforward neural networks, covering classical density results for single-hidden-layer networks and quantitative bounds relating approximation error to network size and target function smoothness. It gives particular emphasis to depth-width trade-offs and the parameter efficiency advantages of deeper architectures for structured function classes. The survey also covers recent theoretical developments on Kolmogorov-Arnold Networks (KANs) as an alternative architectural paradigm with emerging approximation-theoretic analysis.
Related guides (1)
Related events (8)
Neuronal Stochastic Attention Circuit (NSAC) for Probabilistic Representation Learning
Researchers introduce NSAC, a biologically-inspired continuous-time attention architecture that models attention logits as solutions to an Ornstein-Uhlenbeck stochastic differential equation, drawing on C. elegans Neuronal Circuit Policy wiring to induce Gaussian distributions over attention weights. The architecture enables joint quantification of aleatoric and epistemic uncertainty via a two-term objective combining Gaussian negative log-likelihood with an epistemic-separation regularizer. Empirical evaluation spans irregular time-series function approximation, multivariate regression, long-range forecasting, Industry 4.0 tasks, and autonomous vehicle lane-keeping, showing competitive accuracy with well-calibrated uncertainty estimates.
Nyströmformer: Approximating Self-Attention in Linear Time and Memory via the Nyström Method
This Hugging Face blog post covers Nyströmformer, a transformer variant that approximates standard self-attention using the Nyström method to achieve linear time and memory complexity. The approach addresses the quadratic scaling bottleneck of standard attention, enabling processing of longer sequences at reduced computational cost. The post likely covers the model's integration into the Hugging Face ecosystem and its practical use cases.
Dynamic short convolutions yield 1.33–1.60× compute advantage over standard Transformers
A new arXiv preprint introduces dynamic short convolutions as an architectural primitive for Transformers, using input-dependent filters to combine locality bias with increased expressivity. Experiments across 150M–2B parameter language models show consistent perplexity improvements over standard Transformers and static convolution variants, with scaling-law fits indicating a 1.33× compute advantage when applied to key/query/value vectors and 1.60× when added after every linear layer. The technique also improves linear RNNs (Mamba-2, Gated DeltaNet) and mixture-of-experts architectures, with custom Triton kernels making training practical.
Tight Convergence Theory for Error Feedback Algorithms in Distributed Optimization
This paper provides tight convergence analyses for two major error-feedback algorithms—classic Error Feedback (EF) and Error Feedback 21 (EF21)—used to mitigate communication bottlenecks in distributed learning. The authors identify optimal step-size choices and construct tailored Lyapunov functions for each method, yielding guarantees that hold independently of the number of agents and recover the best known single-agent bounds. The work clarifies the relative performance of these gradient compression variants, which has remained poorly understood despite widespread use.
Expressivity Limits of Congruence-Based Architectures for Neural Networks on Positive-Definite Matrices
This paper analyzes neural network architectures designed to classify symmetric positive-definite (SPD) matrices, focusing on congruence-like layers as used in SPDNet. The authors prove that imposing semi-orthogonality constraints on weight matrices limits expressivity, causing deep architectures to collapse to single-hidden-layer equivalents due to spectral diversity loss—a consequence of Poincaré's separation theorem. The work also compares Riemannian classifiers for compatibility with congruence-based feature maps.
Understanding Neural Networks Through Sparse Circuits
OpenAI has published work on mechanistic interpretability using a sparse model approach aimed at understanding how neural networks reason internally. The research seeks to make AI systems more transparent by identifying sparse circuits within neural networks. This work is positioned as supporting safer and more reliable AI behavior through improved interpretability.
Express: Efficient causal attention approximation with formal guarantees and FlashAttention 2 speedups
A new tool called Express converts non-causal attention approximations into causal ones with matching theoretical guarantees, achieving log^(3/2)(n)/s approximation error with O(s) memory. Combined with the Thinformer approximation and an I/O-aware Triton implementation, it demonstrates substantial speedups over FlashAttention 2. The work targets four practical bottlenecks: long-context prefill, KV cache compression, and both memory- and compute-constrained long-form decoding.
Import AI 439: AI kernels, decentralized training, and universal representations
Import AI issue 439 covers topics including AI kernels, decentralized training approaches, and universal representations in neural networks. The newsletter also touches on philosophical questions about how a hypothetical superintelligence might internally represent abstract concepts like a soul. As a tier-2 commentary source, this issue aggregates and contextualizes recent AI/ML developments across research and infrastructure themes.
