Lie-Algebra Attention: tokens as bare matrix Lie group elements with closed-form geometric scores
A new arXiv preprint introduces Lie-Algebra Attention, an attention mechanism where tokens are elements of a matrix Lie group rather than feature vectors, with pairwise attention scores computed as the closed-form algebra norm of the relative pose (log of the group inverse product). The construction achieves equivariance tautologically without representation-theoretic machinery such as irreps, spherical harmonics, or Clebsch-Gordan products, and extends to non-compact affine groups that existing methods cannot handle. Experiments on SE(2), SO(3), and Aff(2) sequence-completion tasks show the closed-form score matches or outperforms learned MLP kernels while using 50–80x fewer score parameters.
Related guides (1)
Related events (8)
Functional Attention: Reinterpreting Attention as Functional Correspondences for Operator Learning
This paper introduces Functional Attention, a novel attention mechanism for operator learning that replaces standard softmax token-wise affinities with structured linear operators inspired by geometric functional maps. The approach treats attention as a correspondence between adaptive bases rather than discrete tokens, yielding a resolution-invariant, globally-aware representation. Experiments show competitive or state-of-the-art performance on PDE solving, 3D segmentation, and regression tasks, with robustness to varying discretizations.
Positional vs. Symbolic Attention Heads: Learning Dynamics, RoPE Geometry, and Length Generalization
Researchers train a decoder-only Transformer (GPT-J) on two structurally equivalent multi-hop reasoning tasks to study how attention heads specialize into positional or symbolic roles during learning. They find that successful task learning correlates with the emergence of 'pure' heads—exclusively positional or symbolic—and provide theoretical constructions showing how single-layer RoPE-based attention realizes these functions geometrically. A novel 'discrepancy' metric formalizes the robustness difference between the two head types, with symbolic mechanisms shown to extrapolate more reliably to longer sequences than positional ones. The findings have implications for understanding length generalization failures in RoPE-based models.
ATLAS: Unified Agentic and Latent Visual Reasoning via Functional Tokens
ATLAS proposes a framework where a single discrete 'functional token' serves dual roles as both an agentic operation trigger and a latent visual reasoning unit in multimodal models. This design avoids the computational cost of generating intermediate images while sidestepping the context-switching latency of external tool calls and the generalization limitations of pure latent methods. The framework is compatible with standard SFT and RL training pipelines without architectural changes, and introduces Latent-Anchored GRPO (LA-GRPO) to stabilize reinforcement learning when functional tokens are sparse. Experiments show strong performance on visual reasoning benchmarks with maintained interpretability.
VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion
VideoMLA applies Multi-Head Latent Attention (MLA) to causal video diffusion, replacing per-head keys and values with a shared low-rank content latent and decoupled 3D-RoPE positional key, achieving 92.7% reduction in per-token KV memory. The paper investigates why MLA works despite pretrained video attention not being low-rank (unlike the spectral assumption motivating MLA in LLMs), finding that the MLA bottleneck itself determines effective rank rather than the pretrained spectrum. On VBench, VideoMLA matches short-horizon baselines, achieves best overall score at long horizons, and delivers 1.23x throughput improvement on a single NVIDIA B200 GPU.
Nyströmformer: Approximating Self-Attention in Linear Time and Memory via the Nyström Method
This Hugging Face blog post covers Nyströmformer, a transformer variant that approximates standard self-attention using the Nyström method to achieve linear time and memory complexity. The approach addresses the quadratic scaling bottleneck of standard attention, enabling processing of longer sequences at reduced computational cost. The post likely covers the model's integration into the Hugging Face ecosystem and its practical use cases.
Local linear structures in LLM weights and activations are dynamic, not fixed global directions
A new arXiv paper investigates the nature of linear structures in transformer weights and activations, finding strong local low-rank task-gradient structure but rejecting the hypothesis that fixed task planes exist. The authors show that useful bases drift substantially within 100 optimization steps, yet early recovery updates form a trajectory-prefix basis capturing 77% of LoRA recovery displacement. They also establish a formal connection between parameter perturbations and activation steering, finding a 0.58 cosine similarity between gradient-step-induced activation shifts and CAA steering vectors, suggesting linear structures are evolving local geometries rather than stable global task directions.
Understanding BigBird's Block Sparse Attention
This Hugging Face blog post provides a technical explanation of BigBird's block sparse attention mechanism, which extends transformer models to handle longer sequences by replacing dense quadratic attention with a combination of local, global, and random sparse attention patterns. The post covers the theoretical underpinnings and implementation details of how BigBird achieves linear complexity with respect to sequence length. It serves as educational commentary on a published research architecture that enables processing of sequences up to 4096 tokens or more efficiently.
ConSA: Learned FA/SWA allocation for efficient hybrid attention in LLMs
ConSA is a framework that learns optimal assignments between full attention and sliding-window attention layers under a user-specified sparsity target, using L0 regularization and augmented Lagrangian constraints. Evaluated on 0.6B and 1.7B parameter models, learned allocations consistently outperform hand-crafted rule-based baselines, with KV-head-wise granularity outperforming layer-wise. A consistent structural pattern emerges: SWA concentrates in bottom layers while FA clusters in contiguous middle-layer blocks, diverging from the evenly interleaved patterns used in existing hybrid architectures.
