Almanac
benchmark

ParaEval

benchmarkactiveprovisionalparaeval-cb886c16·1 events·first seen 7d ago

Aliases: ParaEval

More like this (12)

Recent events (1)

6arXiv · cs.CL·7d ago·source ↗

ParaEval framework reduces MCQA benchmark sensitivity to answer phrasing

A new arXiv preprint identifies a systematic flaw in multiple-choice QA benchmarks: log-likelihood scoring conflates surface-form familiarity with actual capability, producing false performance gaps exceeding 2 points between models trained on identical knowledge. The authors propose ParaEval, which queries models with multiple paraphrases per answer option and scores on the most favorable phrasing, reducing the false gap to below 1 point. The effect is confirmed on frontier 70B and 120B open-source models, suggesting widespread benchmark inflation in standard MCQA evaluations.