New Polish medical exam benchmark reveals MCQA overestimates LLM clinical competence
Researchers introduce an expanded Polish medical exam benchmark with over 15,000 new questions, two new domains, and four structural modifications designed to reduce multiple-choice artifacts and better test reasoning. Evaluating 21 LLMs under the harder setup, the best-performing model (Qwen3.5-122B) drops 28-31 percentage points compared to standard MCQA scores. The findings suggest standard MCQA benchmarks do not reliably reflect true medical competence, even when data contamination is low. The benchmark is publicly released to support further research.
Related guides (1)
Related events (8)
MedMisBench: LLMs show fragile epistemic resilience under misleading medical context
Researchers introduce MedMisBench, a benchmark of 10,932 medical questions paired with 48,889 misleading context injections, to measure whether LLMs maintain correct medical judgment under adversarial pressure. Across 11 model configurations, mean accuracy drops from 71.1% to 38.0% when misleading context is injected, with authority-framed falsehoods achieving 69.5% attack success. A 14-member international clinical panel flagged serious potential harm in 38.2% of reviewed cases. The work argues that existing medical benchmarks measure knowledge but not robustness to manipulation, exposing a structural gap in LLM safety evaluation for healthcare.
Systematic evaluation of LLM prompt sensitivity in healthcare settings reveals safety risks
Researchers conduct a sensitivity analysis of both general-purpose and medical-specific LLMs using the MedMCQA benchmark, testing robustness to lexical and syntactic prompt perturbations. The study finds that even minor phrasing changes can alter clinical advice, and adversarial prompts can produce dangerous outputs such as incorrect dosages or omitted critical findings. Both general-purpose models (GPT-3.5, Llama 3) and domain-specific models (ClinicalBERT, BioLlama3, BioBERT) exhibit this fragility, with syntactic reordering and misleading contextual cues proving more destabilizing than simple paraphrasing.
Empirical study of LLM medical domain adaptation trade-offs in French QA
Researchers present a systematic comparison of continual pretraining (CPT), supervised fine-tuning (SFT), and their combination for adapting LLMs to French medical question answering. The study spans three model families, multiple sizes, and three initialization types, evaluating both multiple-choice and open-ended QA formats. Key findings: CPT+SFT yields the best MCQA scores but gains over SFT alone are often not statistically significant, making SFT a cost-effective default; for open-ended QA, CPT improves overlap metrics while SFT degrades generation quality. Cross-lingual transfer from French adaptation to English benchmarks is also demonstrated.
Benchmarking study finds LLMs fail at counterintuitive probability problems despite strong standard performance
A new arXiv paper evaluates 8 state-of-the-art LLMs on discrete probability problems using two datasets: standard exercises (average accuracy 0.96) and counterintuitive exercises designed to trigger heuristic reasoning (average accuracy 0.59). The authors document token bias causing 20%+ performance drops when canonical problem formulations are disguised, and up to 34% degradation when misleading suggestions are embedded in prompts. The findings argue that current LLMs are not genuine probabilistic reasoners despite their success on advanced math benchmarks.
ParaEval framework reduces MCQA benchmark sensitivity to answer phrasing
A new arXiv preprint identifies a systematic flaw in multiple-choice QA benchmarks: log-likelihood scoring conflates surface-form familiarity with actual capability, producing false performance gaps exceeding 2 points between models trained on identical knowledge. The authors propose ParaEval, which queries models with multiple paraphrases per answer option and scores on the most favorable phrasing, reducing the false gap to below 1 point. The effect is confirmed on frontier 70B and 120B open-source models, suggesting widespread benchmark inflation in standard MCQA evaluations.
Paper challenges LLM expert-level claims by measuring variance and error magnitude in code-based data analysis tasks
A new arXiv paper argues that standard LLM benchmarks overstate model capabilities by focusing on average performance on training-data-adjacent tasks while ignoring response variance and error magnitude. The authors introduce a novel benchmark requiring frontier LLMs to write code for data analysis tasks, comparing results against human expert submissions. Human experts outperformed the frontier LLM on average across multiple metrics and showed lower performance variability. The findings challenge the prevailing narrative that LLMs perform at human-expert level on knowledge economy tasks.
BenCzechMark: A Benchmark for Evaluating LLM Czech Language Understanding
BenCzechMark is a new evaluation benchmark designed to assess large language model performance on Czech language tasks. The benchmark addresses the gap in non-English language evaluation, providing a structured way to measure LLM capabilities in Czech across multiple task types. Published on Hugging Face, it contributes to the growing ecosystem of multilingual and language-specific benchmarks.
Clinically grounded privacy evaluation framework reveals high memorization risk in medical LMs
Researchers introduce a tiered adversarial framework for evaluating privacy leakage in medical language models, moving beyond simple training-text recovery to realistic clinical threat models. Applied to an LM pretrained on 378k clinical notes, the framework finds that routine encounter metadata (name, DOB, provider, visit date) elicits high verbatim memorization and sensitive-diagnosis recovery (AUROC 0.91 for abortion, 0.81 for HIV). The study also finds that exact-match memorization overstates disclosure risk because 36% of memorized tokens reflect templated documentation. The work provides a practical contextual privacy evaluation methodology for medical LMs trained on longitudinal patient data.
