CARVE: Content-aware gating for linear attention recurrent models improves efficiency and quality over GDN-2
CARVE (Content-Aware Recurrent with Value Efficiency) is a new linear attention architecture that addresses three coupled defects in the GDN-2 delta-rule architecture by restricting erasure to the key axis rather than the value axis. This design choice is proven necessary and sufficient to enable the WY-form triangular chunk solver, enabling competitive training throughput with Transformers. At 1.3B parameters trained on 100B tokens, CARVE achieves lower perplexity than GDN-2, leads recurrent baselines on nine commonsense reasoning benchmarks, and sets state-of-the-art on RULER retrieval probes, while using 13% less peak memory and 19% fewer parameters at 0.4% throughput overhead.
Related guides (2)
Related events (8)
Gated DeltaNet-2: Decoupling Erase and Write Gates in Linear Attention
Gated DeltaNet-2 is a new linear attention architecture from NVIDIA Labs that separates the erase and write operations in the delta-rule update into independent channel-wise gates, generalizing both Gated DeltaNet and Kimi Delta Attention (KDA). The model introduces a chunkwise WY algorithm with channel-wise decay and a gate-aware backward pass for efficient parallel training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, it outperforms Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants on language modeling, commonsense reasoning, and long-context RULER needle-in-a-haystack retrieval benchmarks. Code is publicly released via NVlabs on GitHub.
Positional vs. Symbolic Attention Heads: Learning Dynamics, RoPE Geometry, and Length Generalization
Researchers train a decoder-only Transformer (GPT-J) on two structurally equivalent multi-hop reasoning tasks to study how attention heads specialize into positional or symbolic roles during learning. They find that successful task learning correlates with the emergence of 'pure' heads—exclusively positional or symbolic—and provide theoretical constructions showing how single-layer RoPE-based attention realizes these functions geometrically. A novel 'discrepancy' metric formalizes the robustness difference between the two head types, with symbolic mechanisms shown to extrapolate more reliably to longer sequences than positional ones. The findings have implications for understanding length generalization failures in RoPE-based models.
CARV: Compute-Aware Variance Reduction for Diffusion Teacher Gradient Estimation
CARV is a hierarchical Monte Carlo estimation framework that reduces gradient variance when using frozen pretrained diffusion models as teachers in downstream pipelines such as text-to-3D distillation and data attribution. The approach amortizes expensive upstream computation (rendering, simulation, encoding) over cheap diffusion-noise resamples, augmented by timestep importance sampling and stratified-inverse-CDF construction. In text-to-3D experiments, CARV delivers 2–3× effective compute multipliers; in single-step distillation, it cuts gradient variance by an order of magnitude but does not improve FID, revealing that MC variance is not the bottleneck in that regime.
STARE: Token-level advantage reweighting to prevent entropy collapse in GRPO-style RL training
Researchers introduce STARE, a method addressing policy entropy collapse in GRPO-style reinforcement learning from verifiable rewards (RLVR) for LLM post-training. Through first-order gradient analysis, they identify a token-level credit assignment mismatch and propose selectively reweighting advantages for entropy-critical tokens using batch-internal surprisal quantiles plus a closed-loop entropy gate. Evaluated across 1.5B–32B models on short/long chain-of-thought and multi-turn tool use tasks, STARE outperforms DAPO and other baselines by 4–8% on AIME24/25 while sustaining stable training over thousands of steps.
Dynamic short convolutions yield 1.33–1.60× compute advantage over standard Transformers
A new arXiv preprint introduces dynamic short convolutions as an architectural primitive for Transformers, using input-dependent filters to combine locality bias with increased expressivity. Experiments across 150M–2B parameter language models show consistent perplexity improvements over standard Transformers and static convolution variants, with scaling-law fits indicating a 1.33× compute advantage when applied to key/query/value vectors and 1.60× when added after every linear layer. The technique also improves linear RNNs (Mamba-2, Gated DeltaNet) and mixture-of-experts architectures, with custom Triton kernels making training practical.
Reroute: Training-free recoverable visual token routing for vision-language models
A new arXiv preprint proposes Reroute, a training-free plug-in that replaces the standard rank-and-remove visual token pruning paradigm in VLMs with a recoverable routing mechanism. Instead of permanently discarding low-ranked tokens, Reroute defers them to re-enter the candidate pool at later decoder stages, addressing the problem that token importance shifts across decoder depth. Evaluated on LLaVA-1.5 and Qwen backbones augmented with FastV, PDrop, and Nüwa pruning methods, Reroute improves grounding performance under aggressive token reduction without sacrificing general VQA accuracy. The approach preserves the theoretical compute and KV-cache budget of the underlying pruning method.
VaSE: Value-Aware Stochastic KV Cache Eviction improves reasoning model efficiency
A new arXiv preprint introduces Value-aware Stochastic KV Cache Eviction (VaSE), a training-free method for compressing KV caches in long-chain-of-thought reasoning models. The authors identify two key failure modes in prior eviction approaches — catastrophic repetition loops caused by evicting high-magnitude value states, and low cache diversity — and address both with targeted protections and stochastic eviction. On six reasoning tasks with Qwen3 models at 4x compression, VaSE outperforms the current best selection-based sparse attention method and exceeds the strongest eviction baseline by over 4%, while supporting FlashAttention2 and maintaining a static memory footprint.
Introducing RWKV - An RNN with the advantages of a transformer
Hugging Face introduces RWKV, a recurrent neural network architecture that claims to combine the parallelizable training of transformers with the efficient linear-time inference of RNNs. The model avoids the quadratic attention bottleneck of standard transformers while maintaining competitive performance. RWKV represents an alternative architectural direction to the dominant transformer paradigm for language modeling.

