Entity · benchmark

DROP

benchmarkactivedrop-79495e58·3 events·first seen May 19, 2026

Aliases: DROP

Co-occurring entities

MuSiQue HotpotQA Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA LoRA SQuAD operadic consistency Chain-of-Thought Self-Consistency StrategyQA Semantic Entropy Operadic consistency: a label-free signal for compositional reasoning failures in LLMs Open LLM Leaderboard Hugging Face

More like this (12)

PDrop R-Drop Block DDPO R-Drop consistency regularization DPO data exfiltration Grab LayerSkip DiSP DFlash DRY principle

Recent events (3)

6arXiv · cs.CL·Jun 29, 2026·source ↗

LLMs judge worse than they generate: empirical challenge to self-evaluation pipeline assumptions

A new arXiv preprint tests the implicit assumption that LLM evaluation is easier than generation, using a controlled in-context QA setup across four benchmarks (SQuAD 2.0, DROP, HotpotQA, MuSiQue) and two models. Results show generation accuracy exceeds self-evaluation accuracy on three of four benchmarks, with attention analysis revealing that evaluation attends to context 3–5x less than generation does. LoRA fine-tuning experiments confirm the asymmetry is not a training artifact, with cross-task interference observed in both directions. The findings directly challenge assumptions underlying LLM-as-a-Judge and self-evaluation pipelines widely used in RLHF and agentic systems.

Evaluation and Benchmarking Alignment and RLHF MuSiQue Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA LoRA +3 more

6arXiv · cs.LG·Jun 12, 2026·source ↗

Operadic consistency: a label-free signal for detecting compositional reasoning failures in LLMs

Researchers introduce operadic consistency (OC), a label-free inference-time signal that checks whether an LLM's direct answer to a compositional query agrees with the answer produced by composing its own stated decomposition of that query. Evaluated across 12 instruction-tuned LLMs (4B–671B parameters) on four multi-hop QA datasets, OC achieves Pearson r ∈ [0.86, 0.94] with accuracy uniformly across all datasets, outperforming self-consistency, semantic entropy, and P(True) in cross-dataset robustness. At the per-question level, OC provides information beyond existing baselines and yields selective-prediction improvements (AUARC lifts +0.086–0.096, AUROC lifts +0.092–0.164) at equal sampling cost, with results extending to frontier thinking models using chain-of-thought decompositions.

Evaluation and Benchmarking AI Safety Research operadic consistency Chain-of-Thought Self-Consistency MuSiQue +6 more

4Hugging Face Blog·May 19, 2026·source ↗

Open LLM Leaderboard: DROP Deep Dive

Hugging Face published a detailed analysis of the DROP benchmark as used in the Open LLM Leaderboard, examining how models are evaluated on this reading comprehension and numerical reasoning task. The post investigates scoring methodology, potential issues with evaluation consistency, and what DROP results actually reveal about model capabilities. This is part of ongoing efforts to improve transparency and reliability of the Open LLM Leaderboard.

Evaluation and Benchmarking DROP Open LLM Leaderboard Hugging Face