Entity · benchmark

MuSiQue

benchmarkactivemusique-3e68c393·7 events·first seen May 27, 2026

Aliases: MuSiQue

Co-occurring entities

More like this (12)

Quesma CMU-MOSEI MuSciClaims Sierra MOSAIC QMSUM SENSIA CUAD QM9 MuRIL Multica Qwen3.5 MoE

Recent events (7)

5arXiv · cs.CL·Jul 22, 2026·source ↗

MaLoRA and MaRA: Selective state-space adapters improve multi-hop reasoning over LoRA

A new arXiv preprint proposes two adapter families — MaLoRA (token-level dynamic scaling via Mamba recurrence) and MaRA (context-level segment retrieval via cross-segment state tracking) — as improvements over standard LoRA for language model reasoning. Evaluated on three frozen backbones (Qwen-2.5-7B, Llama-3.1-8B, Gemma-2-9B) and two multi-hop QA benchmarks (MuSiQue, 2WikiMultihopQA), the methods yield average gains of +6.8 F1 (+10.5% relative) over LoRA, with up to +18.2% relative improvement on the hardest configuration. Token-level gains also transfer to RULER QA-2 under length-stress conditions.

Long Context Evolution Evaluation and Benchmarking MaRA Gemma 2 9B MaLoRA +5 more

5arXiv · cs.CL·Jul 8, 2026·source ↗

DynaKRAG: Learnable state-conditioned control policy for multi-hop RAG evidence acquisition

DynaKRAG is a new framework that formulates multi-hop retrieval-augmented generation as a state-conditioned control problem over atomic evidence operations (iterative retrieval, query reformulation, sufficiency judging, etc.), using a learned controller to select among valid operations at each step. Evaluated with Qwen2.5-7B-Instruct, it achieves F1 scores of 0.5998 on HotpotQA, 0.5340 on 2WikiMultiHopQA, and 0.3061 on MuSiQue, outperforming the strongest baselines on all three benchmarks. Ablations show that replacing the learned controller with a uniform policy costs 3.96–5.78 F1 points, and that additional retrieval is not uniformly beneficial.

Evaluation and Benchmarking Agent and Tool Ecosystem MuSiQue 2WikiMultiHopQA Qwen2.5-7B-Instruct-1M +2 more

5arXiv · cs.CL·Jul 2, 2026·source ↗

LOCOS: Logit-Contribution Scoring identifies non-literal retrieval heads in long-context LLMs

A new arXiv preprint introduces Logit-Contribution Scoring (LOCOS), a method for identifying attention heads responsible for non-literal retrieval in long-context LLMs — cases where models synthesize answers from meaning rather than copying tokens verbatim. Existing detectors fail at this task because they rely on a literal-copy criterion that misses the output-value (OV) circuit mechanism. Evaluated across Qwen3, Gemma-3, and OLMo-3.1, LOCOS outperforms prior attention-based detectors on the NoLiMa benchmark, with ablation of 50 heads on Qwen3-8B collapsing ROUGE-L from 0.401 to 0.000 while the best baseline retains 0.292. The identified heads are retrieval-specific, leaving parametric recall and arithmetic reasoning unaffected.

Long Context Evolution Evaluation and Benchmarking MuSiQue OLMo-3 Gemma-3-4B-IT +4 more

6arXiv · cs.CL·Jun 29, 2026·source ↗

LLMs judge worse than they generate: empirical challenge to self-evaluation pipeline assumptions

A new arXiv preprint tests the implicit assumption that LLM evaluation is easier than generation, using a controlled in-context QA setup across four benchmarks (SQuAD 2.0, DROP, HotpotQA, MuSiQue) and two models. Results show generation accuracy exceeds self-evaluation accuracy on three of four benchmarks, with attention analysis revealing that evaluation attends to context 3–5x less than generation does. LoRA fine-tuning experiments confirm the asymmetry is not a training artifact, with cross-task interference observed in both directions. The findings directly challenge assumptions underlying LLM-as-a-Judge and self-evaluation pipelines widely used in RLHF and agentic systems.

Evaluation and Benchmarking Alignment and RLHF MuSiQue Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA LoRA +3 more

6arXiv · cs.LG·Jun 12, 2026·source ↗

Operadic consistency: a label-free signal for detecting compositional reasoning failures in LLMs

Researchers introduce operadic consistency (OC), a label-free inference-time signal that checks whether an LLM's direct answer to a compositional query agrees with the answer produced by composing its own stated decomposition of that query. Evaluated across 12 instruction-tuned LLMs (4B–671B parameters) on four multi-hop QA datasets, OC achieves Pearson r ∈ [0.86, 0.94] with accuracy uniformly across all datasets, outperforming self-consistency, semantic entropy, and P(True) in cross-dataset robustness. At the per-question level, OC provides information beyond existing baselines and yields selective-prediction improvements (AUARC lifts +0.086–0.096, AUROC lifts +0.092–0.164) at equal sampling cost, with results extending to frontier thinking models using chain-of-thought decompositions.

Evaluation and Benchmarking AI Safety Research operadic consistency Chain-of-Thought Self-Consistency MuSiQue +6 more

4arXiv · cs.CL·Jun 8, 2026·source ↗

HKVM-RAG: Hypergraph key-value separation improves multi-hop retrieval-augmented generation

A new arXiv preprint introduces HKVM-RAG, an evidence-organization layer for multi-hop RAG that uses weighted hyperedges as retrieval keys while retaining passage text as answer values. Under a fixed-substrate protocol controlling for tuple cache, reader, and evaluation budget, the hypergraph key-value approach improves over KG-PPR by +3.4 F1 on 2WikiMultiHopQA and +3.6 F1 on MuSiQue. A dense-aware controller combining frozen ColBERTv2 with HKVM features reaches 88.8, 65.1, and 85.8 F1 on three benchmarks, outperforming ColBERTv2 alone by 5–11 F1 points. The work positions hypergraph organization as a reusable evidence-control mechanism rather than a dense-retrieval replacement.

Evaluation and Benchmarking Agent and Tool Ecosystem ColBERTv2 MuSiQue 2WikiMultiHopQA +2 more

6arXiv · cs.AI·May 27, 2026·source ↗

BRANE: Natural Language Query-to-Configuration Selection for Retrieval Agents

BRANE is a system that dynamically selects retrieval agent pipeline configurations (LLM, retriever, number of hops, synthesis strategy) at inference time based on per-query characteristics and a cost-quality target. It uses an LLM to extract workload features from each query, then applies lightweight per-configuration predictors to estimate correctness, selecting the configuration that maximizes predicted accuracy penalized by cost. Evaluated on MuSiQue, BrowseComp-Plus, and FinanceBench, BRANE matches best-fixed-configuration accuracy at up to 89% lower cost and outperforms LLM-routing and fine-tuned Qwen3-4B baselines. The work frames per-query pipeline configuration as a practical alternative to static workload-level tuning.

Evaluation and Benchmarking Inference Economics BrowseComp-Plus MuSiQue Qwen3-4B +4 more