4arXiv cs.AI (Artificial Intelligence)·25d ago

Neuronal Stochastic Attention Circuit (NSAC) for Probabilistic Representation Learning

Researchers introduce NSAC, a biologically-inspired continuous-time attention architecture that models attention logits as solutions to an Ornstein-Uhlenbeck stochastic differential equation, drawing on C. elegans Neuronal Circuit Policy wiring to induce Gaussian distributions over attention weights. The architecture enables joint quantification of aleatoric and epistemic uncertainty via a two-term objective combining Gaussian negative log-likelihood with an epistemic-separation regularizer. Empirical evaluation spans irregular time-series function approximation, multivariate regression, long-range forecasting, Industry 4.0 tasks, and autonomous vehicle lane-keeping, showing competitive accuracy with well-calibrated uncertainty estimates.

AI Safety Research Neuronal Stochastic Attention Circuit (NSAC)Neuronal Circuit Policies (NCPs)logistic-normal distribution C. elegans Ornstein-Uhlenbeck stochastic differential equation epistemic-separation regularizer

Related guides (1)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Related events (8)

5arXiv · cs.LG·19d ago·source ↗

Functional Attention: Reinterpreting Attention as Functional Correspondences for Operator Learning

This paper introduces Functional Attention, a novel attention mechanism for operator learning that replaces standard softmax token-wise affinities with structured linear operators inspired by geometric functional maps. The approach treats attention as a correspondence between adaptive bases rather than discrete tokens, yielding a resolution-invariant, globally-aware representation. Experiments show competitive or state-of-the-art performance on PDE solving, 3D segmentation, and regression tasks, with robustness to varying discretizations.

Long Context Evolution Transformers Functional Attention Operator Learning +1 more

5arXiv · cs.CL·3d ago·source ↗

ConSA: Learned FA/SWA allocation for efficient hybrid attention in LLMs

ConSA is a framework that learns optimal assignments between full attention and sliding-window attention layers under a user-specified sparsity target, using L0 regularization and augmented Lagrangian constraints. Evaluated on 0.6B and 1.7B parameter models, learned allocations consistently outperform hand-crafted rule-based baselines, with KV-head-wise granularity outperforming layer-wise. A consistent structural pattern emerges: SWA concentrates in bottom layers while FA clusters in contiguous middle-layer blocks, diverging from the evenly interleaved patterns used in existing hybrid architectures.

Long Context Evolution Inference Economics L0 regularization ConSA

4arXiv · cs.LG·10d ago·source ↗

COGENT: Continuous graph emulator with Neural ODEs for long-term physical forecasting on irregular meshes

COGENT is a new architecture combining graph neural networks with Neural Ordinary Differential Equations for continuous-time physical forecasting on irregular geospatial meshes. The model encodes historical system states and forcings into latent dynamics that can be queried at arbitrary future times, avoiding the error accumulation of autoregressive rollout. Evaluated on ice-sheet simulations from the Ice-sheet and Sea-level System Model, COGENT shows improved long-range stability over autoregressive graph baselines. The work introduces training stabilization strategies including rollout-horizon sampling and progressive scheduling.

Neural Ordinary Differential Equations Ice-sheet and Sea-level System Model COGENT

6Openai Blog·1mo ago·source ↗

Understanding Neural Networks Through Sparse Circuits

OpenAI has published work on mechanistic interpretability using a sparse model approach aimed at understanding how neural networks reason internally. The research seeks to make AI systems more transparent by identifying sparse circuits within neural networks. This work is positioned as supporting safer and more reliable AI behavior through improved interpretability.

Evaluation and Benchmarking AI Safety Research Sparse Circuits mechanistic interpretability OpenAI

7arXiv · cs.AI·29d ago·source ↗

Gated DeltaNet-2: Decoupling Erase and Write Gates in Linear Attention

Gated DeltaNet-2 is a new linear attention architecture from NVIDIA Labs that separates the erase and write operations in the delta-rule update into independent channel-wise gates, generalizing both Gated DeltaNet and Kimi Delta Attention (KDA). The model introduces a chunkwise WY algorithm with channel-wise decay and a gate-aware backward pass for efficient parallel training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, it outperforms Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants on language modeling, commonsense reasoning, and long-context RULER needle-in-a-haystack retrieval benchmarks. Code is publicly released via NVlabs on GitHub.

Training Infrastructure Long Context Evolution NVIDIA Labs Mamba WY Algorithm +7 more

6arXiv · cs.AI·1mo ago·source ↗

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention for Long-Context LLMs

DashAttention introduces a two-stage hierarchical sparse attention mechanism that replaces the fixed top-k block selection used in methods like NSA and InfLLMv2 with an adaptive α-entmax transformation, allowing a variable number of KV blocks to be selected per query. The approach keeps the full hierarchy differentiable by using the first-stage selection as a prior for second-stage softmax attention. Experiments show comparable accuracy to full attention at 75% sparsity with a better Pareto frontier than competing methods, and a Triton GPU implementation achieves meaningful speedup over FlashAttention-3 at inference time.

Training Infrastructure Long Context Evolution Triton InfLLMv2 FlashAttention-3 +4 more

5arXiv · cs.CL·9d ago·source ↗

AGDO: Attention-guided denoising and optimization framework improves diffusion language model reasoning

Researchers propose AGDO, a framework that replaces random masking in diffusion large language models (dLLMs) with attention-guided denoising order and token weighting during fine-tuning and reinforcement learning. The work is motivated by an empirical finding that tokens with stronger attention to unmasked context are more stable and critical for reasoning. Experiments on math and coding benchmarks show AGDO outperforms existing post-training methods for dLLMs, advancing the case for attention-aware training in parallel-decoding language models.

Alignment and RLHF AGDO Beyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language Models

4Hugging Face Blog·1mo ago·source ↗

Nyströmformer: Approximating Self-Attention in Linear Time and Memory via the Nyström Method

This Hugging Face blog post covers Nyströmformer, a transformer variant that approximates standard self-attention using the Nyström method to achieve linear time and memory complexity. The approach addresses the quadratic scaling bottleneck of standard attention, enabling processing of longer sequences at reduced computational cost. The post likely covers the model's integration into the Hugging Face ecosystem and its practical use cases.

Long Context Evolution Inference Economics Nyströmformer Nyström method Hugging Face +1 more