VaSE: Value-Aware Stochastic KV Cache Eviction improves reasoning model efficiency
A new arXiv preprint introduces Value-aware Stochastic KV Cache Eviction (VaSE), a training-free method for compressing KV caches in long-chain-of-thought reasoning models. The authors identify two key failure modes in prior eviction approaches — catastrophic repetition loops caused by evicting high-magnitude value states, and low cache diversity — and address both with targeted protections and stochastic eviction. On six reasoning tasks with Qwen3 models at 4x compression, VaSE outperforms the current best selection-based sparse attention method and exceeds the strongest eviction baseline by over 4%, while supporting FlashAttention2 and maintaining a static memory footprint.
Related guides (2)
Related events (8)
ReasonAlloc: Hierarchical KV Cache Budget Allocation for Long-CoT Reasoning Models
ReasonAlloc is a training-free framework that reframes decoding-time KV cache compression as a hierarchical budget allocation problem, operating at both layer-wise (offline) and head-wise (online) levels. The method identifies an architecture-driven pattern called the 'Reasoning Wave' to guide layer preallocation, then dynamically reallocates to information-rich heads during decoding. Evaluated on MATH-500 and AIME 2024 using DeepSeek-R1-Distill and AceReason models, it outperforms uniform-budget baselines (R-KV, SnapKV, Pyramid-RKV) especially at small budgets of 128–512 tokens, with negligible overhead.
Unlocking Longer Generation with Key-Value Cache Quantization
This Hugging Face blog post covers KV cache quantization as a technique to reduce memory consumption during LLM inference, enabling longer context generation without proportional VRAM increases. The post likely explains how quantizing the key-value cache (e.g., to INT8 or lower precision) trades minimal accuracy for significant memory savings. This is directly relevant to inference efficiency and long-context deployment patterns.
KVEraser: Learned KV cache editing for efficient localized context erasing in LLMs
KVEraser is a learned method for efficiently erasing specific spans from an LLM's KV cache without full recomputation of subsequent tokens. The approach replaces only the KV states of the erased interval with learned steering states, using a two-stage training pipeline of generic pre-training followed by task-specific fine-tuning. On contexts from 1K–32K tokens, KVEraser nearly matches full recomputation quality while incurring only 24% latency overhead versus a 17.6x increase for exact recomputation, with demonstrated generalization to long-document QA with harmful factual distractors.
COMPACT-VA: Planning-aligned token compression for long-context autonomous driving
Researchers introduce COMPACT-VA, a working memory framework using conditional VQ-VAE to compress extended temporal context in vision-action autonomous driving models. Compression is conditioned on historical trajectory and a learned planning intent derived from future trajectories during training, enabling end-to-end optimization without backbone modifications. On high-signal dynamic scenarios, the method achieves 68.3% success rate (>6% improvement) with 3.3x speedup and 2.7x memory reduction over uncompressed processing.
KV Cache from scratch in nanoVLM
This Hugging Face blog post walks through implementing a key-value (KV) cache from scratch within the nanoVLM framework, a minimal vision-language model codebase. The post serves as a technical tutorial explaining how KV caching works in transformer-based multimodal models and how to integrate it for inference efficiency. It targets practitioners seeking to understand the mechanics of KV caching in the context of VLMs rather than just using it as a black box.
Mastering Long Contexts in LLMs with KVPress
NVIDIA and Hugging Face present KVPress, a library for compressing the KV cache in large language models to enable more efficient long-context inference. The tool implements multiple KV cache compression ("pressing") algorithms that reduce memory footprint and latency without retraining models. KVPress is positioned as a practical toolkit for deploying LLMs in long-context scenarios where KV cache size becomes a bottleneck.
VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion
VideoMLA applies Multi-Head Latent Attention (MLA) to causal video diffusion, replacing per-head keys and values with a shared low-rank content latent and decoupled 3D-RoPE positional key, achieving 92.7% reduction in per-token KV memory. The paper investigates why MLA works despite pretrained video attention not being low-rank (unlike the spectral assumption motivating MLA in LLMs), finding that the MLA bottleneck itself determines effective rank rather than the pretrained spectrum. On VBench, VideoMLA matches short-horizon baselines, achieves best overall score at long horizons, and delivers 1.23x throughput improvement on a single NVIDIA B200 GPU.
LCGuard: Adversarial Training Framework for Safe KV Cache Sharing in Multi-Agent LLM Systems
LCGuard introduces a framework for preventing sensitive information leakage when multi-agent LLM systems share KV caches as a latent communication channel. The approach formalizes leakage operationally via reconstruction: a shared cache artifact is deemed unsafe if an adversarial decoder can recover sensitive inputs from it. An adversarial training loop pits a reconstructor against LCGuard's representation-level transformations, which aim to preserve task-relevant semantics while suppressing recoverable sensitive content. Empirical results across multiple model families and multi-agent benchmarks show reduced reconstruction-based leakage and attack success rates with competitive task performance.

