Entity · benchmark

ParaEval

benchmarkactiveparaeval-cb886c16·1 events·first seen Jun 10, 2026

Aliases: ParaEval

More like this (12)

ValueEval L-Eval DeepEval UniEval CharacterEval ProActEval SummEval Every Eval Ever HypoEval T-Eval G-Eval TweetEval

Recent events (1)

6arXiv · cs.CL·Jun 10, 2026·source ↗

ParaEval framework reduces MCQA benchmark sensitivity to answer phrasing

A new arXiv preprint identifies a systematic flaw in multiple-choice QA benchmarks: log-likelihood scoring conflates surface-form familiarity with actual capability, producing false performance gaps exceeding 2 points between models trained on identical knowledge. The authors propose ParaEval, which queries models with multiple paraphrases per answer option and scores on the most favorable phrasing, reducing the false gap to below 1 point. The effect is confirmed on frontier 70B and 120B open-source models, suggesting widespread benchmark inflation in standard MCQA evaluations.

Evaluation and Benchmarking ParaEval