5arXiv cs.CL (Computation and Language)·24d ago

Separating Semantic Competition from Context Length in RAG Reading

This paper introduces a matched-control protocol to isolate whether RAG reader failures stem from context length or semantic competition among retrieved passages. By replacing hard-competitor passages with less competitive ones while holding passage count and length fixed, the authors demonstrate a measurable competition effect on SQuAD using Phi-2 and Qwen2.5-1.5B. Phi-2 recovers +6.0 EM and +7.0 answer-inclusion points; Qwen2.5-1.5B recovers +4.5 EM and +9.0 answer-inclusion points. The study also introduces retention curves and a right-censored half-life metric to track performance degradation as competitors accumulate.

Evaluation and Benchmarking Agent and Tool Ecosystem matched-control protocol retention curve Qwen2.5-1.5B SQuAD Retrieval-Augmented Generation Phi-2

Related guides (3)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Retrieval-Augmented GenerationConcept

Retrieval-Augmented Generation (RAG): Giving AI a Library Card

Read asBeginner In-depth

Related events (8)

6Anthropic News·17d ago·source ↗

Anthropic introduces Contextual Retrieval to reduce RAG retrieval failures by up to 67%

Anthropic published a technical method called Contextual Retrieval that combines Contextual Embeddings and Contextual BM25 to address the context-loss problem in traditional RAG pipelines. The approach prepends chunk-level context before encoding, reducing failed retrievals by 49% standalone and 67% when combined with reranking. The post also highlights prompt caching as a simpler alternative for knowledge bases under 200K tokens, and provides a cookbook for deployment with Claude.

Enterprise Deployment Patterns Agent and Tool Ecosystem Claude BM25 Contextual Retrieval +1 more

4arXiv · cs.CL·12d ago·source ↗

HKVM-RAG: Hypergraph key-value separation improves multi-hop retrieval-augmented generation

A new arXiv preprint introduces HKVM-RAG, an evidence-organization layer for multi-hop RAG that uses weighted hyperedges as retrieval keys while retaining passage text as answer values. Under a fixed-substrate protocol controlling for tuple cache, reader, and evaluation budget, the hypergraph key-value approach improves over KG-PPR by +3.4 F1 on 2WikiMultiHopQA and +3.6 F1 on MuSiQue. A dense-aware controller combining frozen ColBERTv2 with HKVM features reaches 88.8, 65.1, and 85.8 F1 on three benchmarks, outperforming ColBERTv2 alone by 5–11 F1 points. The work positions hypergraph organization as a reusable evidence-control mechanism rather than a dense-retrieval replacement.

Evaluation and Benchmarking Agent and Tool Ecosystem ColBERTv2 MuSiQue 2WikiMultiHopQA +2 more

6arXiv · cs.CL·11d ago·source ↗

ParaEval framework reduces MCQA benchmark sensitivity to answer phrasing

A new arXiv preprint identifies a systematic flaw in multiple-choice QA benchmarks: log-likelihood scoring conflates surface-form familiarity with actual capability, producing false performance gaps exceeding 2 points between models trained on identical knowledge. The authors propose ParaEval, which queries models with multiple paraphrases per answer option and scores on the most favorable phrasing, reducing the false gap to below 1 point. The effect is confirmed on frontier 70B and 120B open-source models, suggesting widespread benchmark inflation in standard MCQA evaluations.

Evaluation and Benchmarking ParaEval

6arXiv · cs.CL·25d ago·source ↗

Semantic vs. Surface Noise in LLM Agents: 68-Cell Measurement Study with Held-Out Validation

This paper documents an empirical phenomenon across 10 LLMs from 7 architecture families: meaning-bearing perturbations (paraphrase, synonym substitution) cause final-answer inconsistency ~19.69 percentage points more often than presentation-level perturbations (formatting, reordering) of comparable severity, across GSM8K, MATH, and HotpotQA benchmarks. The effect is validated on a held-out 11th model (qwen2.5-14B-Instruct) with 1,800 trajectories. Trace-level analysis supports a 'stealth-divergence' picture where semantic perturbations preserve the first action but induce divergence in intermediate reasoning steps, while two prior mechanism claims are explicitly retracted. The study is notable for its honest reporting of stress-test failures and pre-registered replication.

Evaluation and Benchmarking AI Safety Research Qwen2.5-7B-Instruct-1M ReAct stealth-divergence +5 more

5arXiv · cs.CL·2d ago·source ↗

RECOM benchmark reveals validity-discrimination tradeoff in automatic metrics for open-ended QA

Researchers introduce RECOM, a contamination-free evaluation dataset of 15,000 r/AskReddit questions paired with authentic community replies postdating all evaluated models' training cutoffs. Testing five open-source 7–10B LLMs, the paper finds that no standard automatic metric (cosine similarity, BERTScore, LLM judges) simultaneously achieves both validity (distinguishing real from random answers) and discriminative power (ranking models against each other). Cosine similarity is valid but cannot rank models; BERTScore's apparent ranking collapses when response length is controlled. The authors argue this tradeoff is a structural property of metric representation design and recommend reporting metrics on both axes with an explicit random-baseline floor.

Evaluation and Benchmarking BERTScore RECOM r/AskReddit

5arXiv · cs.AI·18d ago·source ↗

RASER: Recoverability-Aware Selective Escalation Router for Multi-Hop Question Answering

RASER introduces a family of lightweight routers that decide whether to escalate retrieval complexity for multi-hop QA without making additional LLM calls. Built on top of one-shot RAG using six derived features, RASER-2 and RASER-3 route queries to progressively more expensive retrieval strategies (PRUNE, IRCoT) only when needed. Across six LLMs and three benchmarks, the routers match SOTA F1 while consuming only 41-49% of the tokens required by always-escalating baselines.

Inference Economics Agent and Tool Ecosystem PRUNE IRCoT RASER +1 more

6arXiv · cs.CL·1mo ago·source ↗

TextReg: Regularization Framework for Mitigating Prompt Distributional Overfitting in LLM Optimization

TextReg addresses a failure mode in iterative prompt optimization where LLM-rewritten prompts grow longer, accumulate narrow rules, and generalize poorly—termed prompt distributional overfitting. The authors formalize this via 'representational inefficiency,' a dual-factor measure decomposing prompt inefficiency into capacity cost and scope narrowness. TextReg applies a soft-penalty regularization framework using Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update. On reasoning benchmarks, it achieves up to +11.8% OOD accuracy over TextGrad and +16.5% over REVOLVE.

Evaluation and Benchmarking Agent and Tool Ecosystem TextGrad REVOLVE Semantic Edit Regularization +4 more

4arXiv · cs.CL·8d ago·source ↗

UMG-RAG: Training-free hybrid retrieval with uncertainty-aware granularity fusion for long-document RAG

Researchers propose Uncertainty-aware Multi-Granularity RAG (UMG-RAG), a training-free hybrid retrieval framework that addresses the tension between large and fine-grained retrieval chunks in RAG pipelines. The system converts dense and sparse retriever scores across multiple chunk granularities into evidence distributions, estimates reliability via entropy, and fuses candidates using query-specific confidence signals. A variant called UMGP-RAG uses fine-grained hits to locate evidence while returning broader parent chunks for coherence. Experiments on QA benchmarks show improved generation quality with no changes to the underlying retriever or generator.

Long Context Evolution Evaluation and Benchmarking Uncertainty-Aware Hybrid Retrieval for Long-Document RAG Uncertainty-aware Multi-Granularity RAG