MuSiQue
musique-3e68c393·3 events·first seen 21d agoAliases: MuSiQue
Co-occurring entities
More like this (12)
Recent events (3)
BRANE: Natural Language Query-to-Configuration Selection for Retrieval Agents
BRANE is a system that dynamically selects retrieval agent pipeline configurations (LLM, retriever, number of hops, synthesis strategy) at inference time based on per-query characteristics and a cost-quality target. It uses an LLM to extract workload features from each query, then applies lightweight per-configuration predictors to estimate correctness, selecting the configuration that maximizes predicted accuracy penalized by cost. Evaluated on MuSiQue, BrowseComp-Plus, and FinanceBench, BRANE matches best-fixed-configuration accuracy at up to 89% lower cost and outperforms LLM-routing and fine-tuned Qwen3-4B baselines. The work frames per-query pipeline configuration as a practical alternative to static workload-level tuning.
HKVM-RAG: Hypergraph key-value separation improves multi-hop retrieval-augmented generation
A new arXiv preprint introduces HKVM-RAG, an evidence-organization layer for multi-hop RAG that uses weighted hyperedges as retrieval keys while retaining passage text as answer values. Under a fixed-substrate protocol controlling for tuple cache, reader, and evaluation budget, the hypergraph key-value approach improves over KG-PPR by +3.4 F1 on 2WikiMultiHopQA and +3.6 F1 on MuSiQue. A dense-aware controller combining frozen ColBERTv2 with HKVM features reaches 88.8, 65.1, and 85.8 F1 on three benchmarks, outperforming ColBERTv2 alone by 5–11 F1 points. The work positions hypergraph organization as a reusable evidence-control mechanism rather than a dense-retrieval replacement.
Operadic consistency: a label-free signal for detecting compositional reasoning failures in LLMs
Researchers introduce operadic consistency (OC), a label-free inference-time signal that checks whether an LLM's direct answer to a compositional query agrees with the answer produced by composing its own stated decomposition of that query. Evaluated across 12 instruction-tuned LLMs (4B–671B parameters) on four multi-hop QA datasets, OC achieves Pearson r ∈ [0.86, 0.94] with accuracy uniformly across all datasets, outperforming self-consistency, semantic entropy, and P(True) in cross-dataset robustness. At the per-question level, OC provides information beyond existing baselines and yields selective-prediction improvements (AUARC lifts +0.086–0.096, AUROC lifts +0.092–0.164) at equal sampling cost, with results extending to frontier thinking models using chain-of-thought decompositions.