4arXiv cs.AI (Artificial Intelligence)·12d ago

COMPACT-VA: Planning-aligned token compression for long-context autonomous driving

Researchers introduce COMPACT-VA, a working memory framework using conditional VQ-VAE to compress extended temporal context in vision-action autonomous driving models. Compression is conditioned on historical trajectory and a learned planning intent derived from future trajectories during training, enabling end-to-end optimization without backbone modifications. On high-signal dynamic scenarios, the method achieves 68.3% success rate (>6% improvement) with 3.3x speedup and 2.7x memory reduction over uncompressed processing.

Long Context Evolution Inference Economics conditional VQ-VAE Planning-aligned Token Compression for Long-Context Autonomous Driving COMPACT-VA

Related guides (2)

Long Context EvolutionTopic guide

Long Context Evolution: From Bigger Windows to Smarter Memory

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·17d ago·source ↗

VaSE: Value-Aware Stochastic KV Cache Eviction improves reasoning model efficiency

A new arXiv preprint introduces Value-aware Stochastic KV Cache Eviction (VaSE), a training-free method for compressing KV caches in long-chain-of-thought reasoning models. The authors identify two key failure modes in prior eviction approaches — catastrophic repetition loops caused by evicting high-magnitude value states, and low cache diversity — and address both with targeted protections and stochastic eviction. On six reasoning tasks with Qwen3 models at 4x compression, VaSE outperforms the current best selection-based sparse attention method and exceeds the strongest eviction baseline by over 4%, while supporting FlashAttention2 and maintaining a static memory footprint.

Frontier Model Releases Inference Economics FlashAttention-3 Qwen3 Value-aware Stochastic KV Cache Eviction

6arXiv · cs.AI·1mo ago·source ↗

Lost in Fog: Sensor Perturbations Expose Reasoning Fragility in Driving VLAs

This paper presents a controlled robustness study of Vision-Language-Action (VLA) models in autonomous driving, evaluating Alpamayo R1 (10B parameters) across ~18,000 inference trials under eight sensor perturbation types including noise, lighting extremes, and fog. The key finding is that Chain-of-Causation (CoC) reasoning consistency is a high-fidelity proxy for trajectory reliability: when CoC explanations change post-perturbation, trajectory deviation spikes 5.3× (r=0.99 across attack types). Enabling CoC generation is associated with 11.8% average improvement in trajectory accuracy, and degradation under noise is approximately linear (R²=0.957), while standard preprocessing defenses offer only marginal benefit.

Evaluation and Benchmarking AI Safety Research Vision-Language-Action model Chain-of-Causation autonomous driving +3 more

7arXiv · cs.CL·11d ago·source ↗

Latent Context Language Models (LCLMs) achieve competitive encoder-decoder KV cache compression at scale

Researchers introduce Latent Context Language Models (LCLMs), a family of encoder-decoder compressors that map long token sequences to shorter latent embeddings consumed by a decoder, targeting the KV cache memory bottleneck in long-context inference. The authors conduct architecture search and continually pre-train 0.6B-encoder/4B-decoder models on over 350B tokens at compression ratios of 1:4, 1:8, and 1:16. LCLMs improve the Pareto frontier across general-task performance, compression speed, and peak memory, and are demonstrated as efficient backbones for long-horizon agents that can skim compressed context and expand relevant segments on demand. The work closes a previously noted gap between encoder-decoder approaches and KV cache compression methods on the accuracy-efficiency frontier.

Long Context Evolution Inference Economics End-to-End Context Compression at Scale Latent Context Language Models +1 more

5Hugging Face Blog·1mo ago·source ↗

Unlocking Longer Generation with Key-Value Cache Quantization

This Hugging Face blog post covers KV cache quantization as a technique to reduce memory consumption during LLM inference, enabling longer context generation without proportional VRAM increases. The post likely explains how quantizing the key-value cache (e.g., to INT8 or lower precision) trades minimal accuracy for significant memory savings. This is directly relevant to inference efficiency and long-context deployment patterns.

Training Infrastructure Long Context Evolution Transformers KV Cache Quantization Hugging Face +1 more

5Hugging Face Blog·1mo ago·source ↗

Mastering Long Contexts in LLMs with KVPress

NVIDIA and Hugging Face present KVPress, a library for compressing the KV cache in large language models to enable more efficient long-context inference. The tool implements multiple KV cache compression ("pressing") algorithms that reduce memory footprint and latency without retraining models. KVPress is positioned as a practical toolkit for deploying LLMs in long-context scenarios where KV cache size becomes a bottleneck.

Long Context Evolution Inference Economics KV Cache KVPress NVIDIA +2 more

6arXiv · cs.CL·23d ago·source ↗

VisualMem: Personal Visual Memory Benchmark and Architecture for Personalized AI Agents

This paper introduces a benchmark and hybrid architecture (VisualMem) for personal visual memory in long-term AI agent memory systems. The work addresses a gap in existing text-centric memory systems by capturing both explicit evidence (recurring user-associated entities) and implicit evidence (latent user facts from visual/multimodal cues) from images. VisualMem augments a text-memory backend with a structured personal visual memory module that uses conversational context to resolve identity, ownership, and durable user facts. Experiments show VisualMem substantially outperforms prior memory systems on the new benchmark while remaining competitive on standard text-memory benchmarks.

Long Context Evolution Evaluation and Benchmarking VisualMem long-term memory Personal Visual Memory Benchmark +3 more

7arXiv · cs.CL·18d ago·source ↗

AdaCodec: Predictive Visual Coding for Efficient Video MLLMs

AdaCodec introduces a predictive visual code interface for video multimodal large language models that exploits temporal redundancy in video. Instead of encoding every sampled frame as an independent RGB image, it sends full visual tokens only for reference frames with high conditional predictive cost, and encodes inter-frame changes as compact P-tokens. Evaluated against a Qwen3-VL-8B per-frame baseline across eleven benchmarks, AdaCodec at 1/7 the token budget (32k vs 224k tokens) surpasses the baseline on all long-video benchmarks while reducing time-to-first-token from 9.26s to 1.62s.

Long Context Evolution Frontier Model Releases Multimodal Large Language Models Qwen3-4B predictive visual code +4 more

5arXiv · cs.AI·9d ago·source ↗

DIRECT: Adaptive test-time compute routing for embodied VLM planners

Researchers introduce DIRECT, a routing framework that dynamically allocates test-time compute for Vision-Language Models acting as embodied planners, using multimodal scene context to decide per-prompt how much compute to spend. Experiments on VLABench and RoboMME benchmarks show that different scaling axes (chain-of-thought depth, model size, memory history) yield qualitatively distinct gains, and that naive uniform scaling is wasteful. On a physical Franka arm, DIRECT matches or exceeds a stronger model's success rate at up to 65% lower average latency, improving the success-cost Pareto frontier.

Inference Economics Agent and Tool Ecosystem RoboMME Franka DROID +2 more