Entity · benchmark

RULER

benchmarkactiveruler-7969fe50·7 events·first seen May 22, 2026

Aliases: RULER

Co-occurring entities

More like this (12)

RULER-CWE RULER QA-2 QUBRIC RASER RMM ULLER Delta Rule ERUnderstand RL² RACES RiVER RLOO

Recent events (7)

6arXiv · cs.CL·3d ago·source ↗

PIVOT: Training-free sparse attention indexer cuts DeepSeek-V3.2 latency by up to 1.6x

PIVOT (Proxy Indexing Via One full-prefix Traversal) is a training-free drop-in replacement for the DeepSeek Sparse Attention (DSA) indexer that reduces the O(L²) per-query scan cost by grouping nearby queries and sharing a single prefix scan across the group. Two variants (PIVOT-Reuse and PIVOT-Refine) trade speed for fidelity, with PIVOT-Refine matching dense indexer accuracy. Evaluated on DeepSeek-V3.2 and GLM-5.1 across LongBench and RULER, PIVOT accelerates the indexer by up to 4x and reduces end-to-end latency by up to 1.6x at long context.

Long Context Evolution Inference Economics DeepSeek V4 GLM-5.1 DeepSeek Sparse Attention +3 more

6arXiv · cs.CL·Jul 15, 2026·source ↗

JoLT: Near-lossless KV cache compression via joint Tucker decomposition and JL-residual allocation

Researchers introduce JoLT, a KV cache compression method that treats the cache as a third-order tensor and applies a partial Tucker decomposition on the token and feature axes, then recovers truncation error with a Johnson-Lindenstrauss rotated low-bit residual. A Lagrangian dual jointly allocates Tucker ranks and residual bit-widths per layer group under a single byte budget. The method achieves 2-3x near-lossless compression on Mistral-7B-v0.3 and LLaMA-2-13B, with Frobenius reconstruction error roughly an order of magnitude below cross-layer SVD and 4-bit quantization. A randomized-SVD variant, FlashJoLT, delivers 5-13x compression-time speedup at matched quality.

Long Context Evolution Inference Economics FlashJoLT Mistral-7B-v0.3 Tucker decomposition +4 more

6arXiv · cs.AI·Jul 3, 2026·source ↗

HOLA adds hippocampal exact KV cache to linear attention, closing gap with full-attention Transformers

HOLA (Hippocampal Linear Attention) augments linear-attention and state-space models with a bounded exact key-value cache inspired by Complementary Learning Systems theory, addressing the lossy compression problem that causes earlier facts to be overwritten in recurrent states. The cache uses a residual-based eviction criterion (large beta * ||e||) without a learned eviction module, and a decoupled RMSNorm-gamma read for sharp retrieval. At 340M parameters trained on 15B SlimPajama tokens, HOLA reduces Wikitext perplexity from 27.32 to 22.92, falling below a full-attention Transformer++ baseline, and shows strong needle-in-a-haystack recall out to 32k tokens despite training only at 2k. The work is directly relevant to the open question of whether linear-attention models can match full-attention on long-context retrieval tasks.

Long Context Evolution WikiText-2 LAMBADA SlimPajama +3 more

6arXiv · cs.CL·Jun 26, 2026·source ↗

CARVE: Content-aware gating for linear attention recurrent models improves efficiency and quality over GDN-2

CARVE (Content-Aware Recurrent with Value Efficiency) is a new linear attention architecture that addresses three coupled defects in the GDN-2 delta-rule architecture by restricting erasure to the key axis rather than the value axis. This design choice is proven necessary and sufficient to enable the WY-form triangular chunk solver, enabling competitive training throughput with Transformers. At 1.3B parameters trained on 100B tokens, CARVE achieves lower perplexity than GDN-2, leads recurrent baselines on nine commonsense reasoning benchmarks, and sets state-of-the-art on RULER retrieval probes, while using 13% less peak memory and 19% fewer parameters at 0.4% throughput overhead.

Training Infrastructure Long Context Evolution WikiText-2 CARVE GDN-2 +2 more

7The Batch·Jun 19, 2026·source ↗

Nvidia Nemotron 3 Ultra: hybrid Mamba-transformer open-weights model targeting agentic workloads

Nvidia released Nemotron 3 Ultra, a 550B parameter (55B active) hybrid Mamba-transformer mixture-of-experts model with a 1M token context window, publishing weights, training data, and RL environments under an open license. The model ranks as the highest-scoring U.S. open-weights model on the Artificial Analysis Intelligence Index (47.7-48.2) and is approximately three times faster than comparable open-weights rivals, though it trails leading Chinese models like Kimi K2.6 and DeepSeek V4 Pro on intelligence benchmarks. Nvidia used a novel Multi-Teacher On-Policy Distillation approach with 10+ specialized teacher models and trained using NVFP4 quantization. The release is strategically motivated by Nvidia's interest in a healthy open-weights ecosystem that drives AI semiconductor adoption.

Frontier Model Releases Open Weights Progress Mamba IFBench Artificial Analysis Intelligence Index +17 more

7The Batch·Jun 2, 2026·source ↗

Nvidia releases Nemotron 3 Super 120B-A12B open-weights model with hybrid Mamba-2/MoE architecture

Nvidia released Nemotron 3 Super 120B-A12B, an open-weights LLM with a hybrid Mamba-2/transformer/MoE architecture that activates only 12B parameters per token and supports up to 1 million token context. The model claims the fastest inference speed in its size class at 442 tokens/second and leads open-weights models on PinchBench agentic task evaluation, outperforming larger models including Kimi K2.5 (1T parameters). Nvidia is releasing weights, training data, and recipes under a permissive commercial license, and plans a $26B five-year investment in open-weights models — framed partly as a strategic response to Chinese labs building capable open-weights models on non-Nvidia hardware.

Frontier Model Releases Open Weights Progress Nemotron 3 Super 120B-A12B Nemotron 3 Ultra-500B-A50B PivotRL +18 more

7arXiv · cs.AI·May 22, 2026·source ↗

Gated DeltaNet-2: Decoupling Erase and Write Gates in Linear Attention

Gated DeltaNet-2 is a new linear attention architecture from NVIDIA Labs that separates the erase and write operations in the delta-rule update into independent channel-wise gates, generalizing both Gated DeltaNet and Kimi Delta Attention (KDA). The model introduces a chunkwise WY algorithm with channel-wise decay and a gate-aware backward pass for efficient parallel training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, it outperforms Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants on language modeling, commonsense reasoning, and long-context RULER needle-in-a-haystack retrieval benchmarks. Code is publicly released via NVlabs on GitHub.

Training Infrastructure Long Context Evolution NVIDIA Labs Mamba WY Algorithm +7 more