Almanac
← Events
5Hugging Face Blog·1mo ago

Mastering Long Contexts in LLMs with KVPress

NVIDIA and Hugging Face present KVPress, a library for compressing the KV cache in large language models to enable more efficient long-context inference. The tool implements multiple KV cache compression ("pressing") algorithms that reduce memory footprint and latency without retraining models. KVPress is positioned as a practical toolkit for deploying LLMs in long-context scenarios where KV cache size becomes a bottleneck.

Related guides (3)

Related events (8)

7arXiv · cs.CL·11d ago·source ↗

Latent Context Language Models (LCLMs) achieve competitive encoder-decoder KV cache compression at scale

Researchers introduce Latent Context Language Models (LCLMs), a family of encoder-decoder compressors that map long token sequences to shorter latent embeddings consumed by a decoder, targeting the KV cache memory bottleneck in long-context inference. The authors conduct architecture search and continually pre-train 0.6B-encoder/4B-decoder models on over 350B tokens at compression ratios of 1:4, 1:8, and 1:16. LCLMs improve the Pareto frontier across general-task performance, compression speed, and peak memory, and are demonstrated as efficient backbones for long-horizon agents that can skim compressed context and expand relevant segments on demand. The work closes a previously noted gap between encoder-decoder approaches and KV cache compression methods on the accuracy-efficiency frontier.

5Hugging Face Blog·1mo ago·source ↗

Unlocking Longer Generation with Key-Value Cache Quantization

This Hugging Face blog post covers KV cache quantization as a technique to reduce memory consumption during LLM inference, enabling longer context generation without proportional VRAM increases. The post likely explains how quantizing the key-value cache (e.g., to INT8 or lower precision) trades minimal accuracy for significant memory savings. This is directly relevant to inference efficiency and long-context deployment patterns.

4Github Trending·8d ago·source ↗

LMCache: KV cache layer for LLM inference acceleration

LMCache is an open-source Python library providing a KV cache layer designed to accelerate LLM inference. The project has accumulated 8,613 GitHub stars with modest daily growth (+17). It targets inference efficiency by offloading or sharing KV cache state across requests.

4Hugging Face Blog·1mo ago·source ↗

KV Cache from scratch in nanoVLM

This Hugging Face blog post walks through implementing a key-value (KV) cache from scratch within the nanoVLM framework, a minimal vision-language model codebase. The post serves as a technical tutorial explaining how KV caching works in transformer-based multimodal models and how to integrate it for inference efficiency. It targets practitioners seeking to understand the mechanics of KV caching in the context of VLMs rather than just using it as a black box.

5arXiv · cs.CL·4d ago·source ↗

KVEraser: Learned KV cache editing for efficient localized context erasing in LLMs

KVEraser is a learned method for efficiently erasing specific spans from an LLM's KV cache without full recomputation of subsequent tokens. The approach replaces only the KV states of the erased interval with learned steering states, using a two-stage training pipeline of generic pre-training followed by task-specific fine-tuning. On contexts from 1K–32K tokens, KVEraser nearly matches full recomputation quality while incurring only 24% latency overhead versus a 17.6x increase for exact recomputation, with demonstrated generalization to long-document QA with harmful factual distractors.

4Hugging Face Blog·1mo ago·source ↗

Optimizing your LLM in production

A Hugging Face blog post covering practical techniques for optimizing large language models in production environments. The post likely addresses inference efficiency methods such as quantization, batching, caching, and hardware utilization strategies. It serves as a practitioner-oriented guide for deploying LLMs at scale.

4arXiv · cs.AI·12d ago·source ↗

COMPACT-VA: Planning-aligned token compression for long-context autonomous driving

Researchers introduce COMPACT-VA, a working memory framework using conditional VQ-VAE to compress extended temporal context in vision-action autonomous driving models. Compression is conditioned on historical trajectory and a learned planning intent derived from future trajectories during training, enabling end-to-end optimization without backbone modifications. On high-signal dynamic scenarios, the method achieves 68.3% success rate (>6% improvement) with 3.3x speedup and 2.7x memory reduction over uncompressed processing.

6arXiv · cs.CL·17d ago·source ↗

VaSE: Value-Aware Stochastic KV Cache Eviction improves reasoning model efficiency

A new arXiv preprint introduces Value-aware Stochastic KV Cache Eviction (VaSE), a training-free method for compressing KV caches in long-chain-of-thought reasoning models. The authors identify two key failure modes in prior eviction approaches — catastrophic repetition loops caused by evicting high-magnitude value states, and low cache diversity — and address both with targeted protections and stochastic eviction. On six reasoning tasks with Qwen3 models at 4x compression, VaSE outperforms the current best selection-based sparse attention method and exceeds the strongest eviction baseline by over 4%, while supporting FlashAttention2 and maintaining a static memory footprint.