6arXiv cs.CL (Computation and Language)·11d ago

ParaEval framework reduces MCQA benchmark sensitivity to answer phrasing

A new arXiv preprint identifies a systematic flaw in multiple-choice QA benchmarks: log-likelihood scoring conflates surface-form familiarity with actual capability, producing false performance gaps exceeding 2 points between models trained on identical knowledge. The authors propose ParaEval, which queries models with multiple paraphrases per answer option and scores on the most favorable phrasing, reducing the false gap to below 1 point. The effect is confirmed on frontier 70B and 120B open-source models, suggesting widespread benchmark inflation in standard MCQA evaluations.

Evaluation and Benchmarking ParaEval

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.LG·2d ago·source ↗

Act2Answer: Benchmarking commonsense and world knowledge retention in Vision-Language-Action models

Researchers introduce Act2Answer, a protocol for evaluating how much commonsense and factual knowledge VLA models retain after fine-tuning on robotics data. The approach converts knowledge benchmark questions into tabletop object-placement episodes, yielding action-grounded success rates that reduce confounds from low-level control failures. A large-scale study of 7 VLA models and 9 VLM baselines finds that VLAs retain solid performance on simple concepts but show larger gaps on richer semantic categories compared to their source VLMs, and that VQA co-training is associated with better knowledge retention.

Evaluation and Benchmarking Multimodal Progress Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models Act2Answer

5arXiv · cs.CL·2d ago·source ↗

RECOM benchmark reveals validity-discrimination tradeoff in automatic metrics for open-ended QA

Researchers introduce RECOM, a contamination-free evaluation dataset of 15,000 r/AskReddit questions paired with authentic community replies postdating all evaluated models' training cutoffs. Testing five open-source 7–10B LLMs, the paper finds that no standard automatic metric (cosine similarity, BERTScore, LLM judges) simultaneously achieves both validity (distinguishing real from random answers) and discriminative power (ranking models against each other). Cosine similarity is valid but cannot rank models; BERTScore's apparent ranking collapses when response length is controlled. The authors argue this tradeoff is a structural property of metric representation design and recommend reporting metrics on both axes with an explicit random-baseline floor.

Evaluation and Benchmarking BERTScore RECOM r/AskReddit

6Openai Blog·1mo ago·source ↗

TruthfulQA: Measuring how models mimic human falsehoods

OpenAI introduced TruthfulQA, a benchmark designed to measure whether language models generate truthful answers or mimic common human misconceptions and falsehoods. The benchmark tests models on questions where humans frequently give wrong answers due to misconceptions, conspiracy theories, or false beliefs. Results showed that larger models were not necessarily more truthful, and in some cases performed worse, highlighting a key alignment challenge.

Evaluation and Benchmarking AI Safety Research TruthfulQA Stephanie Lin Jacob Hilton +3 more

5arXiv · cs.CL·9d ago·source ↗

New Polish medical exam benchmark reveals MCQA overestimates LLM clinical competence

Researchers introduce an expanded Polish medical exam benchmark with over 15,000 new questions, two new domains, and four structural modifications designed to reduce multiple-choice artifacts and better test reasoning. Evaluating 21 LLMs under the harder setup, the best-performing model (Qwen3.5-122B) drops 28-31 percentage points compared to standard MCQA scores. The findings suggest standard MCQA benchmarks do not reliably reflect true medical competence, even when data contamination is low. The benchmark is publicly released to support further research.

Evaluation and Benchmarking Qwen3.5-122B Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?

6arXiv · cs.CL·29d ago·source ↗

Instruction Sensitivity Undermines Embedding Model Evaluation: Single-Prompt Benchmarks Are Insufficient

This paper presents an empirical study of prompt sensitivity in instruction-tuned embedding models, covering 6 models, 11 datasets, and 15 task-specific prompts per dataset (990 total evaluations). The authors demonstrate that single-prompt evaluation systematically misrepresents true model performance, with default prompts both understating and overstating capabilities depending on phrasing. A key finding is that leaderboard rankings are not robust: by selecting prompts favorably, any model in the study can be promoted to first place. The authors recommend that benchmarks incorporate prompt robustness metrics, either through multi-prompt evaluation or by reporting sensitivity alongside point estimates.

Evaluation and Benchmarking Agent and Tool Ecosystem MTEB embedding model leaderboard prompt sensitivity +1 more

5arXiv · cs.CL·12d ago·source ↗

Systematic evaluation of LLM prompt sensitivity in healthcare settings reveals safety risks

Researchers conduct a sensitivity analysis of both general-purpose and medical-specific LLMs using the MedMCQA benchmark, testing robustness to lexical and syntactic prompt perturbations. The study finds that even minor phrasing changes can alter clinical advice, and adversarial prompts can produce dangerous outputs such as incorrect dosages or omitted critical findings. Both general-purpose models (GPT-3.5, Llama 3) and domain-specific models (ClinicalBERT, BioLlama3, BioBERT) exhibit this fragility, with syntactic reordering and misleading contextual cues proving more destabilizing than simple paraphrasing.

Evaluation and Benchmarking AI Safety Research BioLlama3 BioBERT MedMCQA +3 more

5arXiv · cs.CL·1mo ago·source ↗

ACL-Verbatim: Hallucination-Free Extractive QA System for Research Papers

The paper introduces ACL-Verbatim, an extractive question answering system built on VerbatimRAG that maps user queries directly to verbatim text spans in ACL Anthology papers, eliminating hallucination by design. The authors contribute a new ground-truth benchmark dataset created via human NLP-researcher annotation over synthetic queries generated using a ScIRGen-based pipeline. A 150M-parameter ModernBERT token classifier trained on silver supervision achieves the best word-level F1 of 53.6, outperforming the strongest LLM-based extractor at 48.7. The work demonstrates that smaller extractive models can outperform large generative LLMs on precision-critical retrieval tasks.

Evaluation and Benchmarking AI Safety Research ModernBERT ScIRGen ACL Anthology +3 more

5arXiv · cs.CL·12d ago·source ↗

Parameterized framework for measuring sycophantic praise in language models

A new arXiv paper argues that sycophantic praise and flattery constitute a distinct alignment problem separate from the more commonly studied excessive agreement. The authors introduce a parameterized framework that measures whether praise is excessive relative to contribution quality and expected user ability, outperforming generic LLM judges on human annotation agreement. Key finding: sycophantic praise occurs far more frequently in social and interpretive domains than in objective reasoning settings, positioning praise calibration as a distinct alignment challenge.

Evaluation and Benchmarking Alignment and RLHF Sycophantic Praise: Evaluating Excessive Praise in Language Models