Execution-State Capsules: Graph-bound checkpoint/restore for low-latency on-device LLM serving
Researchers introduce execution-state capsules, a checkpoint-and-restore mechanism that snapshots the complete execution state (KV cache, recurrent state, convolution state, MTP state, and metadata) at graph boundaries rather than managing only KV fragments. The FlashRT runtime implements this on NVIDIA CUDA with sub-millisecond GPU-resident snapshot/restore, achieving TTFT speedups of 3.9x at 2k tokens and 27x at 16k tokens over cold prefill on an RTX 5090. The work targets low-latency, small-batch, on-device physical-AI scenarios—interactive agents, speech systems, robot policies—where branching, rollback, and re-entry are common. This is positioned as complementary to, not a replacement for, high-throughput KV-cache serving.
Related guides (2)
Related events (8)
DeltaBox: Millisecond-Level Sandbox Checkpoint/Rollback for Stateful AI Agents
DeltaBox introduces a new OS-level abstraction called DeltaState that enables change-based (delta) checkpoint and rollback for AI agent sandboxes, rather than duplicating full state on each operation. Two co-designed OS mechanisms—DeltaFS for filesystem state and DeltaCR for process state—reduce checkpoint latency to ~14ms and rollback to ~5ms, orders of magnitude faster than existing approaches. Evaluations on SWE-bench and RL micro-benchmarks demonstrate that agents can explore substantially more nodes under fixed time budgets, directly enabling deeper test-time tree search and large-scale RL fan-outs.
VaSE: Value-Aware Stochastic KV Cache Eviction improves reasoning model efficiency
A new arXiv preprint introduces Value-aware Stochastic KV Cache Eviction (VaSE), a training-free method for compressing KV caches in long-chain-of-thought reasoning models. The authors identify two key failure modes in prior eviction approaches — catastrophic repetition loops caused by evicting high-magnitude value states, and low cache diversity — and address both with targeted protections and stochastic eviction. On six reasoning tasks with Qwen3 models at 4x compression, VaSE outperforms the current best selection-based sparse attention method and exceeds the strongest eviction baseline by over 4%, while supporting FlashAttention2 and maintaining a static memory footprint.
Unlocking Longer Generation with Key-Value Cache Quantization
This Hugging Face blog post covers KV cache quantization as a technique to reduce memory consumption during LLM inference, enabling longer context generation without proportional VRAM increases. The post likely explains how quantizing the key-value cache (e.g., to INT8 or lower precision) trades minimal accuracy for significant memory savings. This is directly relevant to inference efficiency and long-context deployment patterns.
QK-Restore: Fixing long-context recall degradation caused by CoT fine-tuning in hybrid LLMs
Researchers find that chain-of-thought supervised fine-tuning systematically degrades long-context recall in hybrid linear-attention models (HypeNet, Jet-Nemotron), with Needle-In-A-Haystack performance collapsing dramatically—e.g., HypeNet-9B dropping from 67.2% to 9.4% at 256K context. The root cause is identified as CoT-SFT biasing attention gradients toward short-range patterns, corrupting the query-key projections responsible for long-range routing. The paper proposes QK-Restore, a training-free fix that restores only W_Q and W_K from the pre-SFT checkpoint, recovering long-context capability while preserving reasoning gains.
COMPACT-VA: Planning-aligned token compression for long-context autonomous driving
Researchers introduce COMPACT-VA, a working memory framework using conditional VQ-VAE to compress extended temporal context in vision-action autonomous driving models. Compression is conditioned on historical trajectory and a learned planning intent derived from future trajectories during training, enabling end-to-end optimization without backbone modifications. On high-signal dynamic scenarios, the method achieves 68.3% success rate (>6% improvement) with 3.3x speedup and 2.7x memory reduction over uncompressed processing.
MemOS: Self-Evolving Memory OS for LLM Agents with Hybrid Retrieval and Token Savings
MemOS is an open-source TypeScript project providing a memory operating system layer for LLM and AI agents, featuring ultra-persistent memory, hybrid retrieval, and cross-task skill reuse. The project claims 35.24% token savings through its memory management approach. It has accumulated 9,329 GitHub stars with moderate daily momentum (+67). The system targets agent memory persistence and efficiency as a foundational infrastructure component.
KV Cache from scratch in nanoVLM
This Hugging Face blog post walks through implementing a key-value (KV) cache from scratch within the nanoVLM framework, a minimal vision-language model codebase. The post serves as a technical tutorial explaining how KV caching works in transformer-based multimodal models and how to integrate it for inference efficiency. It targets practitioners seeking to understand the mechanics of KV caching in the context of VLMs rather than just using it as a black box.
Express: Efficient causal attention approximation with formal guarantees and FlashAttention 2 speedups
A new tool called Express converts non-causal attention approximations into causal ones with matching theoretical guarantees, achieving log^(3/2)(n)/s approximation error with O(s) memory. Combined with the Thinformer approximation and an I/O-aware Triton implementation, it demonstrates substantial speedups over FlashAttention 2. The work targets four practical bottlenecks: long-context prefill, KV cache compression, and both memory- and compute-constrained long-form decoding.

