6arXiv cs.CL (Computation and Language)·2d ago

SelfCompact: Model-driven adaptive context compaction for long agent traces

Researchers propose SelfCompact, a scaffold that lets language models decide when and how to compact their own accumulated context during long agentic runs, rather than relying on fixed token-threshold triggers. The system pairs a compaction tool with a lightweight rubric specifying when to invoke or suppress compaction based on trajectory structure (e.g., sub-task completion vs. mid-derivation). Evaluated across six benchmarks and seven models, SelfCompact matches or exceeds fixed-interval summarization while reducing per-question token cost by 30-70%, with gains of up to 18.1 points on math tasks and 5-9 points on agentic search. The work identifies a 'meta-cognitive gap' in unprompted models and shows it can be closed via scaffolding without fine-tuning.

Long Context Evolution Inference Economics Agent and Tool Ecosystem SelfCompact Self-Compacting Language Model Agents

Related guides (3)

Long Context EvolutionTopic guide

Long Context Evolution: From Bigger Windows to Smarter Memory

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Related events (8)

4arXiv · cs.AI·17d ago·source ↗

COMPACT-VA: Planning-aligned token compression for long-context autonomous driving

Researchers introduce COMPACT-VA, a working memory framework using conditional VQ-VAE to compress extended temporal context in vision-action autonomous driving models. Compression is conditioned on historical trajectory and a learned planning intent derived from future trajectories during training, enabling end-to-end optimization without backbone modifications. On high-signal dynamic scenarios, the method achieves 68.3% success rate (>6% improvement) with 3.3x speedup and 2.7x memory reduction over uncompressed processing.

Long Context Evolution Inference Economics conditional VQ-VAE Planning-aligned Token Compression for Long-Context Autonomous Driving COMPACT-VA

6arXiv · cs.AI·9d ago·source ↗

TokenPilot: Dual-granularity context management cuts LLM agent inference costs by up to 87%

TokenPilot is a cache-efficient context management framework for LLM agents that addresses the trade-off between token sparsity and prompt cache continuity. It combines Ingestion-Aware Compaction (global prefix stabilization) with Lifecycle-Aware Eviction (local segment offloading) to reduce inference costs by 56–87% across benchmarks while maintaining competitive task performance. The system is evaluated on PinchBench and Claw-Eval and has been integrated into the open-source LightMem2 library.

Inference Economics Agent and Tool Ecosystem PinchBench Claw-Eval LightMem +2 more

6arXiv · cs.CL·23d ago·source ↗

AgentCL: A Rigorous Evaluation Framework for Continual Learning in Language Agents

AgentCL is a new benchmark and evaluation framework designed to rigorously assess continual learning in language agents, addressing gaps in existing benchmarks that focus on retrieval over long-context documents or use naive task streams with limited cross-task analysis. The framework constructs compositional task streams where earlier sub-solutions, evidence, or workflows are intentionally reusable in later tasks, contrasting them with naive streams to measure transfer gains. The authors also introduce MemProbe, a probing method that stores interactions, insights, and skills while filtering unreliable experiences during consolidation. Empirical results across coding, deep research, and language understanding tasks show that controlled streams better distinguish memory design quality, and that naive streams can mask memory-induced degradation.

Long Context Evolution Evaluation and Benchmarking AgentCL MemProbe Continual Learning +3 more

7The Batch·23d ago·source ↗

Recursive Language Models Offer Path To Dramatically Expand Beyond the Context Window

MIT researchers Alex L. Zhang, Tim Kraska, and Omar Khattab propose Recursive Language Models (RLMs), a framework that offloads long-context processing to an external Python REPL environment, allowing models to programmatically fetch and manage text chunks via code generation. The root model spawns submodel instances to handle subtasks, aggregating their outputs recursively. Evaluated on benchmarks requiring reasoning over documents up to 11 million tokens, RLMs substantially outperform both base models and competing agentic strategies such as retrieval and summarization agents. For example, RLM-GPT-5 achieved 91.3% on BrowseComp+ versus GPT-5's inability to produce an answer, and ~50% accuracy on OOLONG-PAIRS at 1 million tokens versus near-zero for baseline approaches.

Long Context Evolution Evaluation and Benchmarking MIT OOLONG-PAIRS Tim Kraska +9 more

5arXiv · cs.CL·13d ago·source ↗

Adaptive asymmetric token compression accelerates time series language models up to 7.68×

A new arXiv preprint proposes an adaptive token budgeting framework for time series (TS) language models that compresses TS tokens using frequency-domain structure and progressively prunes prompt tokens across model layers. The authors demonstrate up to 7.68× inference acceleration with performance improvements in 78% of evaluated settings across forecasting, classification, imputation, and anomaly detection tasks. The work is motivated by the observation that TS tokens have uneven spectral contributions and prompt-token influence attenuates with model depth, making uniform token processing wasteful.

Long Context Evolution Inference Economics Beyond Uniform Tokens: Adaptive Compression for Time Series Language Models

4arXiv · cs.CL·14d ago·source ↗

C-DIC: Context-Driven Incremental Compression for efficient long-horizon multi-turn dialogue

A new arXiv preprint introduces Context-Driven Incremental Compression (C-DIC), a method for managing growing dialogue history in conversational agents by treating conversations as interleaved contextual threads with revisable per-thread compression states stored in a compact dialogue memory. A retrieve-revise-write-back loop shares information across turns and updates stale memories, while truncated backpropagation-through-time (TBPTT) is adapted to learn cross-turn dependencies. Experiments on long-form dialogue benchmarks show stable inference latency and perplexity over hundreds of turns, addressing compounding errors seen in existing context compressors.

Long Context Evolution Inference Economics Context-Driven Incremental Compression Context-Driven Incremental Compression for Multi-Turn Dialogue Generation Context-Driven Incremental Compression for Multi-Turn Dialogue Generation

6arXiv · cs.CL·1mo ago·source ↗

LongMINT: Benchmark for Evaluating Memory Under Multi-Target Interference in Long-Horizon Agent Systems

LongMINT is a new benchmark designed to evaluate memory-augmented agents in realistic long-horizon settings where information is repeatedly updated and interferes across memories. It contains 15.6k QA pairs over contexts averaging 138.8k tokens (up to 1.8M tokens), spanning domains including state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits. Evaluation of 7 representative systems—including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks—reveals consistently low average accuracy of 27.9%, with performance particularly degraded on multi-target aggregation tasks and when earlier facts are revised by subsequent context. The analysis identifies retrieval and memory construction as the primary bottlenecks.

Long Context Evolution Evaluation and Benchmarking LongMINT Retrieval-Augmented Generation long-context LLMs +2 more

7arXiv · cs.CL·16d ago·source ↗

Latent Context Language Models (LCLMs) achieve competitive encoder-decoder KV cache compression at scale

Researchers introduce Latent Context Language Models (LCLMs), a family of encoder-decoder compressors that map long token sequences to shorter latent embeddings consumed by a decoder, targeting the KV cache memory bottleneck in long-context inference. The authors conduct architecture search and continually pre-train 0.6B-encoder/4B-decoder models on over 350B tokens at compression ratios of 1:4, 1:8, and 1:16. LCLMs improve the Pareto frontier across general-task performance, compression speed, and peak memory, and are demonstrated as efficient backbones for long-horizon agents that can skim compressed context and expand relevant segments on demand. The work closes a previously noted gap between encoder-decoder approaches and KV cache compression methods on the accuracy-efficiency frontier.

Long Context Evolution Inference Economics End-to-End Context Compression at Scale Latent Context Language Models +1 more