4arXiv cs.CL (Computation and Language)·2d ago

Tool-intent stabilization analysis quantifies when streaming RAG latency hiding is possible

A new arXiv paper introduces 'tool-intent stabilization' — the point in a streaming input at which a speculative retrieval query converges to the correct result — and measures its distribution on the CRAG benchmark (1,371 questions). The authors derive a model-agnostic bound on how much tool latency can be hidden behind remaining user input, finding that at realistic operating parameters 73.9% of queries admit substantial latency hiding. The study requires no model training and validates the bound against a working streaming pipeline, also identifying query properties that predict early versus late stabilization.

Inference Economics Agent and Tool Ecosystem CRAG BM25 When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented Generation

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·19d ago·source ↗

Ghost Tool Calls: Issue-Time Privacy for Speculative Agent Tools

This paper identifies a privacy vulnerability in tool-augmented language agents that speculatively issue future tool calls to reduce latency: these 'ghost tool calls' leak inferred user intent to external services before the agent commits to a branch, and cannot be unsent after the fact. The authors argue that timing—not authorization—is the core issue, and propose Speculative Tool Privacy Contracts, a runtime abstraction treating pre-commitment observation as a distinct first-class effect. A prototype runtime is implemented and twelve policies are evaluated across three corpora, finding that only issue-time argument or destination suppression/modification actually reduces inference leakage.

Inference Economics AI Safety Research tool-augmented language agents Speculative Tool Privacy Contracts Ghost Tool Calls +2 more

6arXiv · cs.CL·25d ago·source ↗

Coverage Illusion: Post-Retrieval Cascade Design Reduces LLM Augmentation Overhead in Production RAG

A case study on the Danish National Encyclopedia's RAG system evaluates five retrieval workflows across 20,000 query-workflow pairs, revealing a 'Coverage Illusion' where synthetic queries overestimate the need for LLM augmentation (90%+) versus real production traffic (27.8%). Pre-retrieval routing cannot detect this gap because augmentation necessity is only revealed after index search. A post-retrieval cascade running workflows cheapest-first and escalating to LLM augmentation only on empty results improves quality by +0.140 Composite Overall points over Always-HyDE, reduces latency by 31.8%, and eliminates LLM augmentation for 72.2% of real queries. The work highlights a structural mismatch between synthetic and real query distributions that affects RAG system design assumptions.

Evaluation and Benchmarking Inference Economics RAG Coverage Illusion HyDE +5 more

5arXiv · cs.CL·11d ago·source ↗

DocTrace: Structure-Aware On-Demand Hypergraph Memory for Long-Document QA

Researchers introduce DocTrace, a multi-agent RAG framework for long-document question answering that uses query-triggered knowledge organization rather than costly query-agnostic preprocessing. The system combines a lightweight document structural tree index, on-demand hypergraph working memory, and a graph-structured experience memory that stores successful reasoning plans for reuse. Evaluated on four long-document QA datasets, DocTrace outperforms the strongest baseline (ComoRAG) by up to 8.85% F1 and 4.40% EM while reducing computational cost by 53.32%.

Long Context Evolution Agent and Tool Ecosystem ComoRAG DocTrace Trace Only What You Need: Structure-Aware On-Demand Hypergraph Memory for Long-Document Question Answering

4arXiv · cs.CL·9d ago·source ↗

UMG-RAG: Training-free hybrid retrieval with uncertainty-aware granularity fusion for long-document RAG

Researchers propose Uncertainty-aware Multi-Granularity RAG (UMG-RAG), a training-free hybrid retrieval framework that addresses the tension between large and fine-grained retrieval chunks in RAG pipelines. The system converts dense and sparse retriever scores across multiple chunk granularities into evidence distributions, estimates reliability via entropy, and fuses candidates using query-specific confidence signals. A variant called UMGP-RAG uses fine-grained hits to locate evidence while returning broader parent chunks for coherence. Experiments on QA benchmarks show improved generation quality with no changes to the underlying retriever or generator.

Long Context Evolution Evaluation and Benchmarking Uncertainty-Aware Hybrid Retrieval for Long-Document RAG Uncertainty-aware Multi-Granularity RAG

6arXiv · cs.CL·20d ago·source ↗

Question-Answering as Hidden State Probing for Test-Time Reasoning Intervention

This paper proposes using question-asking as an inference-time intervention to surface information about an LLM's hidden state during chain-of-thought reasoning. The authors train a probe on a student model's hidden states before and after question generation, finding it predictive of final answer correctness even before the teacher responds—suggesting self-diagnosis during question generation carries meaningful signal. They frame question-asking as a sequential decision problem with a gating policy, but find a gap between detection and recovery: interventions are as likely to harm correct trajectories as to fix incorrect ones. The results have implications for the limits of LLM self-refinement under uncertainty.

Evaluation and Benchmarking Agent and Tool Ecosystem student-teacher prompting Chain-of-Thought Reasoning inference-time intervention +4 more

4arXiv · cs.CL·13d ago·source ↗

HKVM-RAG: Hypergraph key-value separation improves multi-hop retrieval-augmented generation

A new arXiv preprint introduces HKVM-RAG, an evidence-organization layer for multi-hop RAG that uses weighted hyperedges as retrieval keys while retaining passage text as answer values. Under a fixed-substrate protocol controlling for tuple cache, reader, and evaluation budget, the hypergraph key-value approach improves over KG-PPR by +3.4 F1 on 2WikiMultiHopQA and +3.6 F1 on MuSiQue. A dense-aware controller combining frozen ColBERTv2 with HKVM features reaches 88.8, 65.1, and 85.8 F1 on three benchmarks, outperforming ColBERTv2 alone by 5–11 F1 points. The work positions hypergraph organization as a reusable evidence-control mechanism rather than a dense-retrieval replacement.

Evaluation and Benchmarking Agent and Tool Ecosystem ColBERTv2 MuSiQue 2WikiMultiHopQA +2 more

5arXiv · cs.LG·2d ago·source ↗

Probe-and-Refine Tuning improves coding agent performance via iterative repository guidance refinement

A new arXiv paper introduces probe-and-refine tuning, a procedure that uses synthetic bug-fix probes to iteratively improve AGENTS.md repository guidance files for LLM-based coding agents without requiring an agent loop during tuning. Evaluated on SWE-bench Verified with Qwen3.5-35B-A3B, the method achieves 33.0% mean resolve rate versus 28.3% for a static knowledge base baseline and 25.5% for an unguided baseline. The improvement is attributed to coverage gains—refined guidance helps agents locate the correct files rather than improving patch quality—and a step-budget experiment shows guidance is necessary for agents to productively use larger compute budgets.

Evaluation and Benchmarking Agent and Tool Ecosystem Qwen3.5-35B-A3B SWE-Bench Verified NVIDIA Nemotron-3-Nano-30B-A3B +2 more

6Anthropic News·17d ago·source ↗

Anthropic introduces Contextual Retrieval to reduce RAG retrieval failures by up to 67%

Anthropic published a technical method called Contextual Retrieval that combines Contextual Embeddings and Contextual BM25 to address the context-loss problem in traditional RAG pipelines. The approach prepends chunk-level context before encoding, reducing failed retrievals by 49% standalone and 67% when combined with reranking. The post also highlights prompt caching as a simpler alternative for knowledge bases under 200K tokens, and provides a cookbook for deployment with Claude.

Enterprise Deployment Patterns Agent and Tool Ecosystem Claude BM25 Contextual Retrieval +1 more