5arXiv cs.AI (Artificial Intelligence)·10d ago

DIRECT: Adaptive test-time compute routing for embodied VLM planners

Researchers introduce DIRECT, a routing framework that dynamically allocates test-time compute for Vision-Language Models acting as embodied planners, using multimodal scene context to decide per-prompt how much compute to spend. Experiments on VLABench and RoboMME benchmarks show that different scaling axes (chain-of-thought depth, model size, memory history) yield qualitatively distinct gains, and that naive uniform scaling is wasteful. On a physical Franka arm, DIRECT matches or exceeds a stronger model's success rate at up to 65% lower average latency, improving the success-cost Pareto frontier.

Inference Economics Agent and Tool Ecosystem RoboMME Franka DROID DIRECT VLABench

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·10d ago·source ↗

Reroute: Training-free recoverable visual token routing for vision-language models

A new arXiv preprint proposes Reroute, a training-free plug-in that replaces the standard rank-and-remove visual token pruning paradigm in VLMs with a recoverable routing mechanism. Instead of permanently discarding low-ranked tokens, Reroute defers them to re-enter the candidate pool at later decoder stages, addressing the problem that token importance shifts across decoder depth. Evaluated on LLaVA-1.5 and Qwen backbones augmented with FastV, PDrop, and Nüwa pruning methods, Reroute improves grounding performance under aggressive token reduction without sacrificing general VQA accuracy. The approach preserves the theoretical compute and KV-cache budget of the underlying pruning method.

Inference Economics Multimodal Progress FastV PDrop Qwen +4 more

4arXiv · cs.AI·13d ago·source ↗

COMPACT-VA: Planning-aligned token compression for long-context autonomous driving

Researchers introduce COMPACT-VA, a working memory framework using conditional VQ-VAE to compress extended temporal context in vision-action autonomous driving models. Compression is conditioned on historical trajectory and a learned planning intent derived from future trajectories during training, enabling end-to-end optimization without backbone modifications. On high-signal dynamic scenarios, the method achieves 68.3% success rate (>6% improvement) with 3.3x speedup and 2.7x memory reduction over uncompressed processing.

Long Context Evolution Inference Economics conditional VQ-VAE Planning-aligned Token Compression for Long-Context Autonomous Driving COMPACT-VA

5arXiv · cs.AI·16d ago·source ↗

TempoVLA: Speed-Controllable Vision-Language-Action Policy for Robot Manipulation

Researchers introduce TempoVLA, a Vision-Language-Action model that enables explicit speed control during robot manipulation by conditioning on a speed signal rather than inheriting a fixed speed from training data. The system pairs Variable-Speed Trajectory Augmentation (VSTA), which re-times demonstrations by merging or splitting actions, with a model-side conditioning mechanism. Experiments in simulation and real-world tasks show flexible bidirectional speed control, with dynamic adaptation—accelerating in low-risk transit phases and decelerating for high-risk contact stages—achieved by coupling with a large multimodal model.

Agent and Tool Ecosystem Multimodal Progress Variable-Speed Trajectory Augmentation TempoVLA TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

6arXiv · cs.CL·1mo ago·source ↗

Mem-π: Adaptive Memory for LLM Agents via On-Demand Generation and Decoupled RL

Mem-π introduces a framework where a dedicated language or vision-language model generates context-specific guidance for LLM agents on demand, rather than retrieving static entries from episodic memory banks. The system is trained with a decision-content decoupled reinforcement learning objective that jointly learns when to generate guidance and what to generate, enabling abstention when generation would not help. Evaluated across web navigation, terminal-based tool use, and text-based embodied interaction benchmarks, Mem-π achieves over 30% relative improvement on web navigation tasks compared to retrieval-based and prior RL-optimized memory baselines.

Evaluation and Benchmarking Agent and Tool Ecosystem web navigation benchmark Mem-π large language model agents +3 more

4arXiv · cs.AI·5d ago·source ↗

PACT: Hybrid SLM deliberation architecture improves reactive RL policies in unfamiliar environments

Researchers propose PACT (Plan, Align, Commit, Think), a hybrid architecture pairing a fast reactive RL policy with an asynchronous small language model planner for deliberation. The SLM generates and validates candidate action plans via simulation before committing to execution, bypassing the RL policy without retraining. Evaluated on FrozenLake configurations of increasing difficulty, PACT outperforms baselines using only a 2B-parameter SLM, suggesting complementary strengths between deliberative planning and reactive execution.

Agent and Tool Ecosystem PACT

4arXiv · cs.LG·6d ago·source ↗

Dual-adapter routing system improves knowledge editing precision in LLMs

A new arXiv paper introduces a route-specialized dual-adapter architecture for knowledge editing in LLMs, separating the concerns of writing edits (edit adapter) and suppressing them when irrelevant (locality adapter). A relevance router gates which adapter is applied, addressing the locality problem in memory-assisted editing. Evaluated on CounterFact, zsRE, and MQuAKE benchmarks using Llama-3.1-8B-Instruct and Qwen3-8B, the method achieves best-in-class probability-preference accuracy across all three datasets. Ablations show the gain comes from the architectural separation rather than increased parameter capacity.

Evaluation and Benchmarking Alignment and RLHF BGE Llama3-8B-Instruct Qwen3-4B +4 more

5arXiv · cs.AI·11d ago·source ↗

EEVEE: Multi-dataset test-time prompt learning framework for self-improving LLM agents

EEVEE is a new framework enabling LLM agents to perform test-time prompt learning across heterogeneous multi-dataset task streams, addressing a gap where prior methods only handled single-dataset settings. The system uses a router to partition inputs into task clusters and assigns them to suitable prompt configurations, optimized via a router-prompt co-evolution strategy. Experiments show improvements of 10.38 and 24.32 average points over Qwen3-4B-Instruct and DeepSeek-V3.2 respectively, outperforming prior SOTA methods GEPA and ACE by up to 48.2%.

Evaluation and Benchmarking Agent and Tool Ecosystem ACE DeepSeek V4 Qwen3-4B-Instruct +2 more

5arXiv · cs.CL·19d ago·source ↗

CRAM: Centroid-Routing and Adaptive MoE for Multimodal Continual Instruction Tuning

CRAM is a new method for Multimodal Continual Instruction Tuning (MCIT) that addresses the tension between catastrophic forgetting and parameter efficiency in MLLMs. It combines adaptive-rank instantiation to dynamically allocate parameters based on capability gaps, centroid-guided routing to reuse existing expert knowledge, and an orthogonality penalty to confine new updates to task-specific directions. The approach uses a Mixture-of-Experts architecture where task-specific patterns are isolated into independent modules, avoiding both the interference of shared updates and the parameter bloat of fully isolated expansion. Experiments across diverse benchmarks show consistent improvements over existing MCIT methods.

Enterprise Deployment Patterns Agent and Tool Ecosystem Multimodal Large Language Models CRAM centroid-guided routing +4 more