TokenPilot: Dual-granularity context management cuts LLM agent inference costs by up to 87%
TokenPilot is a cache-efficient context management framework for LLM agents that addresses the trade-off between token sparsity and prompt cache continuity. It combines Ingestion-Aware Compaction (global prefix stabilization) with Lifecycle-Aware Eviction (local segment offloading) to reduce inference costs by 56–87% across benchmarks while maintaining competitive task performance. The system is evaluated on PinchBench and Claw-Eval and has been integrated into the open-source LightMem2 library.
Related guides (2)
Related events (8)
CLP: Lightweight collocation-length predictor achieves zero-loss multi-token inference speedup
Researchers propose CLP (Collocation-Length Predictor), a span-level decision layer for accelerating LLM inference via multi-token prediction without quality degradation. The key insight is 'Backbone-as-Architect': the backbone LM head always generates the first token while MTP heads handle only subsequent tokens, eliminating head-backbone competition that causes repetitive outputs in prior methods. CLP uses a single linear layer (~4.6K–7.7K parameters) versus 1M-parameter gate networks in prior work, achieving 1.14x–1.29x speedup on Qwen2.5 models with near-zero repetition ratio. The paper also establishes that shorter prediction horizons improve MTP head accuracy on larger models, offering a scaling-aware design principle.
Latent Context Language Models (LCLMs) achieve competitive encoder-decoder KV cache compression at scale
Researchers introduce Latent Context Language Models (LCLMs), a family of encoder-decoder compressors that map long token sequences to shorter latent embeddings consumed by a decoder, targeting the KV cache memory bottleneck in long-context inference. The authors conduct architecture search and continually pre-train 0.6B-encoder/4B-decoder models on over 350B tokens at compression ratios of 1:4, 1:8, and 1:16. LCLMs improve the Pareto frontier across general-task performance, compression speed, and peak memory, and are demonstrated as efficient backbones for long-horizon agents that can skim compressed context and expand relevant segments on demand. The work closes a previously noted gap between encoder-decoder approaches and KV cache compression methods on the accuracy-efficiency frontier.
Pair-In, Pair-Out (PIPO): Unified Latent Compression and Multi-Token Prediction for Efficient LLM Inference
PIPO is a new inference efficiency framework that unifies input-side latent compression with output-side multi-token prediction (MTP) by treating them as mirror operations: a compressor folds two input tokens into one latent, while an MTP head unfolds one hidden state into an additional output token. To avoid the expensive verifier pass typically required by speculative decoding, PIPO trains a lightweight confidence head using On-Policy Distillation (OPD), which naturally aligns with rejection-sampling criteria. Experiments on Qwen3.5-4B and 9B backbones across AIME 2025, GPQA-Diamond, LiveCodeBench v6, and LongBench v2 show up to 2.64× first-token-latency speedup and +7.15 pass@4 improvement over regular decoding.
MemOS: Self-Evolving Memory OS for LLM Agents with Hybrid Retrieval and Token Savings
MemOS is an open-source TypeScript project providing a memory operating system layer for LLM and AI agents, featuring ultra-persistent memory, hybrid retrieval, and cross-task skill reuse. The project claims 35.24% token savings through its memory management approach. It has accumulated 9,329 GitHub stars with moderate daily momentum (+67). The system targets agent memory persistence and efficiency as a foundational infrastructure component.
LongMINT: Benchmark for Evaluating Memory Under Multi-Target Interference in Long-Horizon Agent Systems
LongMINT is a new benchmark designed to evaluate memory-augmented agents in realistic long-horizon settings where information is repeatedly updated and interferes across memories. It contains 15.6k QA pairs over contexts averaging 138.8k tokens (up to 1.8M tokens), spanning domains including state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits. Evaluation of 7 representative systems—including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks—reveals consistently low average accuracy of 27.9%, with performance particularly degraded on multi-target aggregation tasks and when earlier facts are revised by subsequent context. The analysis identifies retrieval and memory construction as the primary bottlenecks.
Mem-π: Adaptive Memory for LLM Agents via On-Demand Generation and Decoupled RL
Mem-π introduces a framework where a dedicated language or vision-language model generates context-specific guidance for LLM agents on demand, rather than retrieving static entries from episodic memory banks. The system is trained with a decision-content decoupled reinforcement learning objective that jointly learns when to generate guidance and what to generate, enabling abstention when generation would not help. Evaluated across web navigation, terminal-based tool use, and text-based embodied interaction benchmarks, Mem-π achieves over 30% relative improvement on web navigation tasks compared to retrieval-based and prior RL-optimized memory baselines.
ContextRL: Context-aware reinforcement learning improves grounding in agentic and multimodal LLMs
Researchers introduce ContextRL, a reinforcement learning method that trains LLMs to select the context that supports a given query-answer pair from two highly similar candidates, rather than supervising only final answers. The approach constructs contrastive context pairs in two domains: coding agent trajectories (1k pairs) and multimodal image pairs (7k pairs). ContextRL achieves +2.2% average gains over standard GRPO on 5 long-horizon benchmarks and +1.8% across 12 visual QA benchmarks, with ablations showing the gains stem from the context-selection objective rather than the contrastive data alone.
Infini Memory: Topic-structured persistent memory architecture for long-term LLM agents
Researchers propose Infini Memory, a persistent memory architecture for LLM agents that organizes memory as topic-structured documents rather than isolated records or summaries. New observations are staged in a buffer and periodically consolidated, while retrieval uses iterative agentic tool calls instead of a single lookup step. The system achieves 64.7% on MemoryAgentBench, with ablations showing complementary gains from topic-structured maintenance and iterative evidence inspection.

