Entity · benchmark

HotpotQA

benchmarkactivehotpotqa-138f624c·9 events·first seen May 26, 2026

Aliases: HotpotQA

Co-occurring entities

More like this (12)

FreshQA StrategyQA TableQA TriviaRoomQA ResearchQA GQA MedQADE SimpleQA ChartQA BBQ HOTA QVal

Recent events (9)

5arXiv · cs.CL·Jul 23, 2026·source ↗

SelectBench and DAPO post-training for selective evidence adoption in RAG contexts

Researchers introduce SelectBench, a benchmark and training set for evaluating whether retrieval-augmented LLMs can selectively adopt valid evidence while rejecting misleading or injected content. They post-train Qwen3.5-4B using DAPO with rule-based and semantic judge rewards, achieving modest but directional improvements on SelectBench-v2 (22.46% to 26.46% strict success). Gains do not survive Holm multiple-comparison correction, and prompt-injection resistance shows no improvement, leaving statistical robustness and injection resistance as open challenges. General capabilities on MMLU and HotpotQA are preserved.

Evaluation and Benchmarking AI Safety Research SelectBench DAPO DeepSeek V4 +3 more

6arXiv · cs.CL·Jul 17, 2026·source ↗

Bridge Evidence: Static Retrieval Utility Does Not Predict Causal Utility in Multi-Step Agentic Search

A new arXiv paper demonstrates empirically that static retrieval relevance scores are nearly statistically independent from causal utility in multi-step agentic search (Spearman rho = -0.026 across 23,322 document observations). Using a ReAct-style agent over HotpotQA with counterfactual trajectory replays, the authors show that roughly a third of documents are causally load-bearing ('bridge documents') while appearing useless to a static reader. The mechanism is identified: bridge documents supply discriminative entities that redirect subsequent queries, meaning optimizing static RAG utility does not deliver agentic utility.

Evaluation and Benchmarking Agent and Tool Ecosystem BM25 HotpotQA Observable Entity Relevance +2 more

5arXiv · cs.CL·Jul 16, 2026·source ↗

DeepStress: A stress-testing framework for search agent robustness to poor-quality evidence

DeepStress is a new evaluation framework that stress-tests search agents by replacing their retrieval module with a controlled synthetic environment, allowing systematic manipulation of document trustworthiness, relevance, and factuality. The authors test several search agents on HotpotQA and BrowseCompPlus, revealing substantial performance differences in handling unreliable information. The work introduces new metrics to capture system outcomes and conflicts between parametric and retrieved knowledge, addressing a gap in realistic benchmarks that rarely surface low-quality evidence scenarios.

Evaluation and Benchmarking Agent and Tool Ecosystem BrowseComp-Plus DeepStress HotpotQA

5arXiv · cs.CL·Jul 8, 2026·source ↗

DynaKRAG: Learnable state-conditioned control policy for multi-hop RAG evidence acquisition

DynaKRAG is a new framework that formulates multi-hop retrieval-augmented generation as a state-conditioned control problem over atomic evidence operations (iterative retrieval, query reformulation, sufficiency judging, etc.), using a learned controller to select among valid operations at each step. Evaluated with Qwen2.5-7B-Instruct, it achieves F1 scores of 0.5998 on HotpotQA, 0.5340 on 2WikiMultiHopQA, and 0.3061 on MuSiQue, outperforming the strongest baselines on all three benchmarks. Ablations show that replacing the learned controller with a uniform policy costs 3.96–5.78 F1 points, and that additional retrieval is not uniformly beneficial.

Evaluation and Benchmarking Agent and Tool Ecosystem MuSiQue 2WikiMultiHopQA Qwen2.5-7B-Instruct-1M +2 more

6arXiv · cs.CL·Jun 29, 2026·source ↗

LLMs judge worse than they generate: empirical challenge to self-evaluation pipeline assumptions

A new arXiv preprint tests the implicit assumption that LLM evaluation is easier than generation, using a controlled in-context QA setup across four benchmarks (SQuAD 2.0, DROP, HotpotQA, MuSiQue) and two models. Results show generation accuracy exceeds self-evaluation accuracy on three of four benchmarks, with attention analysis revealing that evaluation attends to context 3–5x less than generation does. LoRA fine-tuning experiments confirm the asymmetry is not a training artifact, with cross-task interference observed in both directions. The findings directly challenge assumptions underlying LLM-as-a-Judge and self-evaluation pipelines widely used in RLHF and agentic systems.

Evaluation and Benchmarking Alignment and RLHF MuSiQue Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA LoRA +3 more

6arXiv · cs.CL·Jun 18, 2026·source ↗

Decoupled Search Grounding (DSG): vendor-agnostic MCP-compatible architecture for LLM agent retrieval

Researchers introduce Decoupled Search Grounding (DSG), an architecture that moves real-time search grounding outside the reasoning model via an MCP-compatible gateway, exposing provider routing, caching, and retrieval-depth as explicit controls. Evaluated across five frontier models on SimpleQA, FreshQA, and HotpotQA, DSG nearly matches native search accuracy on SimpleQA (86.1% vs. 87.7%) while achieving 91% lower search cost and 68% lower latency via a 99.4% warm-cache hit rate. In a production e-commerce deployment, DSG cuts search cost by over 98% while matching or slightly exceeding native-search accuracy. The work frames real-time grounding as an optimizable interface boundary rather than a fixed model feature, with direct relevance to MCP-based agent infrastructure.

Inference Economics Enterprise Deployment Patterns FreshQA HotpotQA Decoupled Search Grounding +3 more

6arXiv · cs.LG·Jun 12, 2026·source ↗

Operadic consistency: a label-free signal for detecting compositional reasoning failures in LLMs

Researchers introduce operadic consistency (OC), a label-free inference-time signal that checks whether an LLM's direct answer to a compositional query agrees with the answer produced by composing its own stated decomposition of that query. Evaluated across 12 instruction-tuned LLMs (4B–671B parameters) on four multi-hop QA datasets, OC achieves Pearson r ∈ [0.86, 0.94] with accuracy uniformly across all datasets, outperforming self-consistency, semantic entropy, and P(True) in cross-dataset robustness. At the per-question level, OC provides information beyond existing baselines and yields selective-prediction improvements (AUARC lifts +0.086–0.096, AUROC lifts +0.092–0.164) at equal sampling cost, with results extending to frontier thinking models using chain-of-thought decompositions.

Evaluation and Benchmarking AI Safety Research operadic consistency Chain-of-Thought Self-Consistency MuSiQue +6 more

4arXiv · cs.CL·Jun 8, 2026·source ↗

HKVM-RAG: Hypergraph key-value separation improves multi-hop retrieval-augmented generation

A new arXiv preprint introduces HKVM-RAG, an evidence-organization layer for multi-hop RAG that uses weighted hyperedges as retrieval keys while retaining passage text as answer values. Under a fixed-substrate protocol controlling for tuple cache, reader, and evaluation budget, the hypergraph key-value approach improves over KG-PPR by +3.4 F1 on 2WikiMultiHopQA and +3.6 F1 on MuSiQue. A dense-aware controller combining frozen ColBERTv2 with HKVM features reaches 88.8, 65.1, and 85.8 F1 on three benchmarks, outperforming ColBERTv2 alone by 5–11 F1 points. The work positions hypergraph organization as a reusable evidence-control mechanism rather than a dense-retrieval replacement.

Evaluation and Benchmarking Agent and Tool Ecosystem ColBERTv2 MuSiQue 2WikiMultiHopQA +2 more

6arXiv · cs.CL·May 26, 2026·source ↗

Semantic vs. Surface Noise in LLM Agents: 68-Cell Measurement Study with Held-Out Validation

This paper documents an empirical phenomenon across 10 LLMs from 7 architecture families: meaning-bearing perturbations (paraphrase, synonym substitution) cause final-answer inconsistency ~19.69 percentage points more often than presentation-level perturbations (formatting, reordering) of comparable severity, across GSM8K, MATH, and HotpotQA benchmarks. The effect is validated on a held-out 11th model (qwen2.5-14B-Instruct) with 1,800 trajectories. Trace-level analysis supports a 'stealth-divergence' picture where semantic perturbations preserve the first action but induce divergence in intermediate reasoning steps, while two prior mechanism claims are explicitly retracted. The study is notable for its honest reporting of stress-test failures and pre-registered replication.

Evaluation and Benchmarking AI Safety Research Qwen2.5-7B-Instruct-1M ReAct stealth-divergence +5 more