6arXiv cs.CL (Computation and Language)·24d ago

Coverage Illusion: Post-Retrieval Cascade Design Reduces LLM Augmentation Overhead in Production RAG

A case study on the Danish National Encyclopedia's RAG system evaluates five retrieval workflows across 20,000 query-workflow pairs, revealing a 'Coverage Illusion' where synthetic queries overestimate the need for LLM augmentation (90%+) versus real production traffic (27.8%). Pre-retrieval routing cannot detect this gap because augmentation necessity is only revealed after index search. A post-retrieval cascade running workflows cheapest-first and escalating to LLM augmentation only on empty results improves quality by +0.140 Composite Overall points over Always-HyDE, reduces latency by 31.8%, and eliminates LLM augmentation for 72.2% of real queries. The work highlights a structural mismatch between synthetic and real query distributions that affects RAG system design assumptions.

Evaluation and Benchmarking Inference Economics Enterprise Deployment Patterns Agent and Tool Ecosystem RAG Coverage Illusion HyDE Query Expansion Post-Retrieval Cascade Danish National Encyclopedia

Related guides (4)

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating

Read asIn-depth

Inference EconomicsTopic guide

Inference Economics: The Cost Structure of Running AI Models in Production

Read asIn-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

Expert Support Case Study: Bolstering a RAG App with LLM-as-a-Judge

Hugging Face published a case study describing how Digital Green used an LLM-as-a-Judge approach to evaluate and improve a retrieval-augmented generation (RAG) application. The post covers the methodology for using LLMs to score and validate RAG outputs, providing a practical deployment pattern for quality assurance in production AI systems. It serves as a concrete example of enterprise-grade evaluation pipelines built on top of RAG architectures.

Evaluation and Benchmarking Enterprise Deployment Patterns LLM-as-a-Judge Digital Green Hugging Face +2 more

4arXiv · cs.CL·8d ago·source ↗

UMG-RAG: Training-free hybrid retrieval with uncertainty-aware granularity fusion for long-document RAG

Researchers propose Uncertainty-aware Multi-Granularity RAG (UMG-RAG), a training-free hybrid retrieval framework that addresses the tension between large and fine-grained retrieval chunks in RAG pipelines. The system converts dense and sparse retriever scores across multiple chunk granularities into evidence distributions, estimates reliability via entropy, and fuses candidates using query-specific confidence signals. A variant called UMGP-RAG uses fine-grained hits to locate evidence while returning broader parent chunks for coherence. Experiments on QA benchmarks show improved generation quality with no changes to the underlying retriever or generator.

Long Context Evolution Evaluation and Benchmarking Uncertainty-Aware Hybrid Retrieval for Long-Document RAG Uncertainty-aware Multi-Granularity RAG

6Anthropic News·17d ago·source ↗

Anthropic introduces Contextual Retrieval to reduce RAG retrieval failures by up to 67%

Anthropic published a technical method called Contextual Retrieval that combines Contextual Embeddings and Contextual BM25 to address the context-loss problem in traditional RAG pipelines. The approach prepends chunk-level context before encoding, reducing failed retrievals by 49% standalone and 67% when combined with reranking. The post also highlights prompt caching as a simpler alternative for knowledge bases under 200K tokens, and provides a cookbook for deployment with Claude.

Enterprise Deployment Patterns Agent and Tool Ecosystem Claude BM25 Contextual Retrieval +1 more

4arXiv · cs.CL·12d ago·source ↗

HKVM-RAG: Hypergraph key-value separation improves multi-hop retrieval-augmented generation

A new arXiv preprint introduces HKVM-RAG, an evidence-organization layer for multi-hop RAG that uses weighted hyperedges as retrieval keys while retaining passage text as answer values. Under a fixed-substrate protocol controlling for tuple cache, reader, and evaluation budget, the hypergraph key-value approach improves over KG-PPR by +3.4 F1 on 2WikiMultiHopQA and +3.6 F1 on MuSiQue. A dense-aware controller combining frozen ColBERTv2 with HKVM features reaches 88.8, 65.1, and 85.8 F1 on three benchmarks, outperforming ColBERTv2 alone by 5–11 F1 points. The work positions hypergraph organization as a reusable evidence-control mechanism rather than a dense-retrieval replacement.

Evaluation and Benchmarking Agent and Tool Ecosystem ColBERTv2 MuSiQue 2WikiMultiHopQA +2 more

5arXiv · cs.CL·4d ago·source ↗

RL-trained LLMs learn retriever-specific query formulation strategies for RAG

A new arXiv paper presents the first systematic study of using reinforcement learning to teach LLMs to adapt query formulation strategies to different retrieval backends. The authors find that different retrievers have surprisingly distinct optimal query styles (e.g., descriptive vs. question-like), making cross-retriever strategy transfer ineffective. They introduce a branching-based rollout technique to stabilize training over multi-step retrieval trajectories and show gains from retriever-specific human guidance and model scaling.

Evaluation and Benchmarking Agent and Tool Ecosystem Understanding the Behaviors of Environment-aware Information Retrieval LCO-Embedding

5arXiv · cs.AI·18d ago·source ↗

RASER: Recoverability-Aware Selective Escalation Router for Multi-Hop Question Answering

RASER introduces a family of lightweight routers that decide whether to escalate retrieval complexity for multi-hop QA without making additional LLM calls. Built on top of one-shot RAG using six derived features, RASER-2 and RASER-3 route queries to progressively more expensive retrieval strategies (PRUNE, IRCoT) only when needed. Across six LLMs and three benchmarks, the routers match SOTA F1 while consuming only 41-49% of the tokens required by always-escalating baselines.

Inference Economics Agent and Tool Ecosystem PRUNE IRCoT RASER +1 more

5Github Trending·1mo ago·source ↗

LEANN: RAG System with 97% Storage Savings for On-Device Private Retrieval

LEANN is an open-source retrieval-augmented generation (RAG) system targeting personal device deployment with claimed 97% storage reduction compared to conventional vector index approaches. The project is associated with MLsys 2026, suggesting an upcoming systems research paper. It emphasizes privacy through fully local execution and aims to maintain retrieval accuracy despite aggressive compression. The repository has accumulated over 11,000 stars with strong recent momentum.

Inference Economics Enterprise Deployment Patterns LEANN yichuan-w MLsys 2026 +2 more

6arXiv · cs.AI·8d ago·source ↗

RA-RFT: Retrieval-Augmented Reinforcement Fine-Tuning teaches LLMs to reason by analogy

Researchers propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that trains a retriever to rank contexts by expected reasoning benefit rather than semantic similarity, then fine-tunes a policy model via reinforcement learning using retrieved analogous demonstrations. The key insight is that reasoning-relevant retrieval surfaces complementary solution strategies rather than superficially similar problems. On mathematical reasoning benchmarks, RA-RFT improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively, suggesting reasoning-aware retrieval is orthogonal to reward design and training curriculum improvements.

Evaluation and Benchmarking Alignment and RLHF RA-RFT Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning GRPO +3 more