6arXiv cs.AI (Artificial Intelligence)·5d ago

Parallel-Synthesis framework enables LLM agents to consume KV caches directly, cutting synthesis latency 2.5x–11x

Researchers introduce Parallel-Synthesis, a plug-and-play framework that allows a synthesizer LLM to directly consume KV caches produced by parallel worker agents instead of concatenating their textual outputs. The system combines a cache mapper for calibrating independently generated branch caches with a fine-tuned synthesizer adapter, trained via distillation from standard text-concatenation synthesis. Evaluated across nine datasets spanning math, science QA, code generation, GAIA, and multi-agent database diagnosis, it matches or outperforms text-based synthesis on seven datasets while reducing time-to-first-token by 2.5x–11x. The work proposes a fundamentally different interface for multi-agent synthesis that avoids redundant prefill computation inherent in sequential text merging.

Inference Economics Agent and Tool Ecosystem Parallel-Synthesis GAIA

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Related events (8)

7arXiv · cs.CL·11d ago·source ↗

Latent Context Language Models (LCLMs) achieve competitive encoder-decoder KV cache compression at scale

Researchers introduce Latent Context Language Models (LCLMs), a family of encoder-decoder compressors that map long token sequences to shorter latent embeddings consumed by a decoder, targeting the KV cache memory bottleneck in long-context inference. The authors conduct architecture search and continually pre-train 0.6B-encoder/4B-decoder models on over 350B tokens at compression ratios of 1:4, 1:8, and 1:16. LCLMs improve the Pareto frontier across general-task performance, compression speed, and peak memory, and are demonstrated as efficient backbones for long-horizon agents that can skim compressed context and expand relevant segments on demand. The work closes a previously noted gap between encoder-decoder approaches and KV cache compression methods on the accuracy-efficiency frontier.

Long Context Evolution Inference Economics End-to-End Context Compression at Scale Latent Context Language Models +1 more

5arXiv · cs.CL·11d ago·source ↗

AGENTSERVESIM: Hardware-aware simulator for multi-turn LLM agent serving policies

Researchers introduce AGENTSERVESIM, a simulation framework designed to evaluate serving policies for multi-turn LLM agents without requiring dedicated accelerator hardware. The simulator models program-level execution including turn dependencies, tool-induced gaps, and KV-cache residency across HBM, host DRAM, and CXL memory hierarchies. It reproduces real-system behavior within 6% error on key performance metrics while running on commodity CPUs, enabling cost-effective exploration of scheduling, routing, and cache management policies for agentic workloads.

Training Infrastructure Inference Economics AGENTSERVESIM +1 more

4arXiv · cs.CL·5d ago·source ↗

Synthetic data generation method enables small LLMs to match large models on Text-To-Cypher tasks

A new arXiv paper presents an automatic synthetic data generation method for fine-tuning small LLMs on Text-To-Cypher (Text2Cypher) parsing, enabling natural language interfaces to property graph databases. Experiments across major Text-To-Cypher benchmarks show that small fine-tuned models can compete with much larger proprietary models. The approach is positioned as a solution for local deployment scenarios requiring data sovereignty without expensive annotation.

Evaluation and Benchmarking Enterprise Deployment Patterns Cypher Achieving Precise Text-To-Cypher Via Grounded Knowledge Graph Data Generation

5arXiv · cs.CL·17d ago·source ↗

Synthetic LLM-generated conversations improve ASR training for low-resource languages

Researchers propose a pipeline that uses LLMs to generate scenario-level dialogues and TTS to synthesize multi-speaker audio, creating simulated conversational training data for ASR systems. Evaluated on the Hungarian BEA-Dialogue benchmark, a model trained on 67 hours of real plus 636 hours of synthetic data outperforms a zero-shot model trained on 2,700 hours of real Hungarian speech. The study tests five LLM families under multiple budget and mixing configurations using a FastConformer-Large backbone, finding that generator choice and data composition significantly affect gains.

Evaluation and Benchmarking FastConformer-Large Efficient ASR Training with Conversations that Never Happened BEA-Dialogue

7arXiv · cs.AI·1mo ago·source ↗

Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

This paper introduces agent just-in-time (JIT) compilation as an alternative to the sequential fetch-screenshot-execute loop used by current computer-use agents. The approach compiles natural language task descriptions directly into executable code that can include LLM calls, tool calls, and parallelization, using three components: JIT-Planner, JIT-Scheduler, and an invariant-enforcing tool protocol. Across five web applications, JIT-Planner achieves 10.4× speedup and +28% accuracy over Browser-Use, while JIT-Scheduler achieves 2.4× speedup and +9% accuracy over OpenAI CUA.

Frontier Model Releases Evaluation and Benchmarking JIT-Scheduler OpenAI CUA Browser-Use +6 more

6arXiv · cs.AI·29d ago·source ↗

LCGuard: Adversarial Training Framework for Safe KV Cache Sharing in Multi-Agent LLM Systems

LCGuard introduces a framework for preventing sensitive information leakage when multi-agent LLM systems share KV caches as a latent communication channel. The approach formalizes leakage operationally via reconstruction: a shared cache artifact is deemed unsafe if an adversarial decoder can recover sensitive inputs from it. An adversarial training loop pits a reconstructor against LCGuard's representation-level transformations, which aim to preserve task-relevant semantics while suppressing recoverable sensitive content. Empirical results across multiple model families and multi-agent benchmarks show reduced reconstruction-based leakage and attack success rates with competitive task performance.

Inference Economics AI Safety Research KV Cache representation-level sensitive information leakage LCGuard +4 more

6arXiv · cs.AI·4d ago·source ↗

TokenPilot: Dual-granularity context management cuts LLM agent inference costs by up to 87%

TokenPilot is a cache-efficient context management framework for LLM agents that addresses the trade-off between token sparsity and prompt cache continuity. It combines Ingestion-Aware Compaction (global prefix stabilization) with Lifecycle-Aware Eviction (local segment offloading) to reduce inference costs by 56–87% across benchmarks while maintaining competitive task performance. The system is evaluated on PinchBench and Claw-Eval and has been integrated into the open-source LightMem2 library.

Inference Economics Agent and Tool Ecosystem PinchBench Claw-Eval LightMem +2 more

4Hugging Face Blog·1mo ago·source ↗

KV Cache from scratch in nanoVLM

This Hugging Face blog post walks through implementing a key-value (KV) cache from scratch within the nanoVLM framework, a minimal vision-language model codebase. The post serves as a technical tutorial explaining how KV caching works in transformer-based multimodal models and how to integrate it for inference efficiency. It targets practitioners seeking to understand the mechanics of KV caching in the context of VLMs rather than just using it as a black box.

Inference Economics Multimodal Progress KV Cache nanoVLM Vision-Language Models +1 more