Almanac
benchmark

HotpotQA

benchmarkactiveprovisionalhotpotqa-138f624c·3 events·first seen 22d ago

Aliases: HotpotQA

Co-occurring entities

More like this (12)

Recent events (3)

6arXiv · cs.CL·22d ago·source ↗

Semantic vs. Surface Noise in LLM Agents: 68-Cell Measurement Study with Held-Out Validation

This paper documents an empirical phenomenon across 10 LLMs from 7 architecture families: meaning-bearing perturbations (paraphrase, synonym substitution) cause final-answer inconsistency ~19.69 percentage points more often than presentation-level perturbations (formatting, reordering) of comparable severity, across GSM8K, MATH, and HotpotQA benchmarks. The effect is validated on a held-out 11th model (qwen2.5-14B-Instruct) with 1,800 trajectories. Trace-level analysis supports a 'stealth-divergence' picture where semantic perturbations preserve the first action but induce divergence in intermediate reasoning steps, while two prior mechanism claims are explicitly retracted. The study is notable for its honest reporting of stress-test failures and pre-registered replication.

4arXiv · cs.CL·9d ago·source ↗

HKVM-RAG: Hypergraph key-value separation improves multi-hop retrieval-augmented generation

A new arXiv preprint introduces HKVM-RAG, an evidence-organization layer for multi-hop RAG that uses weighted hyperedges as retrieval keys while retaining passage text as answer values. Under a fixed-substrate protocol controlling for tuple cache, reader, and evaluation budget, the hypergraph key-value approach improves over KG-PPR by +3.4 F1 on 2WikiMultiHopQA and +3.6 F1 on MuSiQue. A dense-aware controller combining frozen ColBERTv2 with HKVM features reaches 88.8, 65.1, and 85.8 F1 on three benchmarks, outperforming ColBERTv2 alone by 5–11 F1 points. The work positions hypergraph organization as a reusable evidence-control mechanism rather than a dense-retrieval replacement.

6arXiv · cs.LG·5d ago·source ↗

Operadic consistency: a label-free signal for detecting compositional reasoning failures in LLMs

Researchers introduce operadic consistency (OC), a label-free inference-time signal that checks whether an LLM's direct answer to a compositional query agrees with the answer produced by composing its own stated decomposition of that query. Evaluated across 12 instruction-tuned LLMs (4B–671B parameters) on four multi-hop QA datasets, OC achieves Pearson r ∈ [0.86, 0.94] with accuracy uniformly across all datasets, outperforming self-consistency, semantic entropy, and P(True) in cross-dataset robustness. At the per-question level, OC provides information beyond existing baselines and yields selective-prediction improvements (AUARC lifts +0.086–0.096, AUROC lifts +0.092–0.164) at equal sampling cost, with results extending to frontier thinking models using chain-of-thought decompositions.