4arXiv cs.LG (Machine Learning)·18d ago

Expressivity Limits of Congruence-Based Architectures for Neural Networks on Positive-Definite Matrices

This paper analyzes neural network architectures designed to classify symmetric positive-definite (SPD) matrices, focusing on congruence-like layers as used in SPDNet. The authors prove that imposing semi-orthogonality constraints on weight matrices limits expressivity, causing deep architectures to collapse to single-hidden-layer equivalents due to spectral diversity loss—a consequence of Poincaré's separation theorem. The work also compares Riemannian classifiers for compatibility with congruence-based feature maps.

Evaluation and Benchmarking congruence-based layers SPDNet Poincaré separation theorem symmetric positive-definite matrices Riemannian classifiers

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·4d ago·source ↗

Internal Oppenheim-Lim test reveals phase/sign identity codes shared across image classifier architectures

A new arXiv preprint applies a causal intervention inspired by Oppenheim and Lim (1981) to probe whether trained image classifiers encode identity in Fourier phase rather than magnitude within their hidden layers. By transplanting phase or sign components between images at chosen layers in PRISM2D, GFNet, ViT-B/16, and ResNet-50, the authors find that predictions follow the phase/sign donor across all tested architectures, with image-specific magnitude largely dispensable. ResNet-50 requires a pre-ReLU intervention to reveal a latent sign code, exposing how rectification and readout geometry shape the basis in which the code is expressed. The findings offer a mechanistic account of the texture–shape gap between CNNs and attention-based models.

Evaluation and Benchmarking ViT-B/16 GFNet PRISM2D +2 more

4arXiv · cs.LG·11d ago·source ↗

Conservation laws from data symmetry in neural network gradient-flow training

A new arXiv preprint investigates whether intrinsic symmetries in training data produce conserved quantities during gradient-flow training of neural networks. The authors prove that for analytic, non-polynomial loss functions, data symmetries generically do not induce additional integrals of motion, but for MSE loss, data augmentation can yield extra conserved quantities. They introduce a framework of 'tensorizable networks'—architectures including linear, polynomial, and Lightning Attention networks—where parameter and input dependence can be separated via an intermediate representation.

Training Infrastructure Lightning Attention Conservation Laws from Data Symmetry in Neural Networks

7arXiv · cs.AI·29d ago·source ↗

The Matching Principle: A Geometric Theory Unifying Robustness, Domain Adaptation, and Alignment via Nuisance Covariance

This paper proposes the 'matching principle': a unified geometric framework arguing that robustness methods (CORAL, IRM, adversarial training, augmentation, metric learning, Jacobian penalties, alignment constraints) are all estimators of the same object—the covariance of label-preserving deployment nuisance—and that regularizing the encoder Jacobian along this covariance's range is the core statistical problem. The authors prove closed-form optimality results in a linear-Gaussian model, introduce the Trajectory Deviation Index (TDI) as a label-free embedding sensitivity probe, and validate predictions across 13 pre-registered experimental blocks including Qwen2.5-7B. At 7B scale, matched style-PMH improves selective honesty while standard DPO degrades Style TDI, connecting the theory to alignment safety.

Evaluation and Benchmarking AI Safety Research Invariant Risk Minimization Matching Principle Qwen2.5-7B +5 more

4arXiv · cs.LG·15d ago·source ↗

PC Layer: Polynomial weight preconditioning for stable LLM pre-training

Researchers propose a PC (preconditioning) layer that applies polynomial preconditioning to reshape the singular-value spectrum of weight matrices during LLM training, improving conditioning stability. The preconditioned weights merge back into the original architecture at inference time with no overhead. Experiments on Llama-1B pre-training show advantages over standard transformers for both AdamW and Muon optimizers, with theoretical convergence guarantees for deep linear networks.

Training Infrastructure AdamW PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training Llama 1B +1 more

5arXiv · cs.AI·1mo ago·source ↗

Beyond Isotropy in JEPAs: Hamiltonian Geometry and Symplectic Prediction

This paper critiques the standard practice of regularizing Joint-Embedding Predictive Architecture (JEPA) encoders toward isotropic Gaussian marginals, showing that this Euclidean symmetry assumption incurs a quantifiable 'price of isotropy' and that no geometry-independent fixed marginal target is universally canonical. The authors prove that oracle one-view marginals do not identify the view-to-view predictive coupling, arguing structural bias should enter the cross-view coupling instead. They introduce HamJEPA, which encodes views as phase-space states and uses a learned Hamiltonian leapfrog map for view-to-view prediction, with symplectic coupling identified as the key driver of gains. HamJEPA outperforms SIGReg on CIFAR-100 by up to +6.45 kNN@20 and +10.64 linear-probe points at 80 epochs, with similar improvements on ImageNet-100.

Evaluation and Benchmarking Alignment and RLHF ImageNet-100 HamJEPA CIFAR-100 +4 more

4arXiv · cs.LG·8d ago·source ↗

Theoretical analysis of truncated positional encodings for graph neural networks

A new arXiv paper initiates a formal study of truncated positional encodings (PEs) for graph neural networks, showing that truncation breaks the theoretical equivalence between spectral and walk-based PE families. Key findings include that truncated spectral PEs lose their advantage over the 1-WL expressivity test, and that k-harmonic distances differ meaningfully from other closely related truncated spectral PEs. Experiments on real-world datasets suggest mixing truncated PE families outperforms any single family.

Evaluation and Benchmarking Understanding Truncated Positional Encodings for Graph Neural Networks

5arXiv · cs.AI·11d ago·source ↗

AdamO optimizer and dynamical isometry regularization preserve plasticity in continual learning

A new arXiv preprint connects plasticity loss in continual learning to the empirical Neural Tangent Kernel and identifies dynamical isometry—keeping layer-wise Jacobian singular values near one—as a key mechanism for maintaining learning capacity under non-stationarity. The authors propose an isometry-promoting regularization scheme that can reactivate dormant ReLU units and introduce AdamO, an Adam-style optimizer that decouples isometry regularization from gradient updates analogously to AdamW. The methods are evaluated on supervised and reinforcement-learning continual-learning benchmarks, consistently matching or outperforming prior approaches. The work also reinterprets existing plasticity-preserving methods as targeting only partial isometry measures.

Alignment and RLHF AdamW Neural Tangent Kernel AdamO +1 more

5arXiv · cs.AI·1mo ago·source ↗

Survey: Approximation Theory for Neural Networks — Classical Results and New Directions Including KANs

This arxiv survey reviews four decades of universal approximation theory for feedforward neural networks, covering classical density results for single-hidden-layer networks and quantitative bounds relating approximation error to network size and target function smoothness. It gives particular emphasis to depth-width trade-offs and the parameter efficiency advantages of deeper architectures for structured function classes. The survey also covers recent theoretical developments on Kolmogorov-Arnold Networks (KANs) as an alternative architectural paradigm with emerging approximation-theoretic analysis.

Evaluation and Benchmarking Feedforward Neural Networks Universal Approximation Theorem depth-width trade-offs +2 more