Almanac
paper

Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?

paperactiveprovisionalreassessing-high-performing-llms-on-polish-medical-exams-true-competence-or-bias-driven-performance--f8db2972·1 events·first seen 6d ago

Aliases: Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.CL·6d ago·source ↗

New Polish medical exam benchmark reveals MCQA overestimates LLM clinical competence

Researchers introduce an expanded Polish medical exam benchmark with over 15,000 new questions, two new domains, and four structural modifications designed to reduce multiple-choice artifacts and better test reasoning. Evaluating 21 LLMs under the harder setup, the best-performing model (Qwen3.5-122B) drops 28-31 percentage points compared to standard MCQA scores. The findings suggest standard MCQA benchmarks do not reliably reflect true medical competence, even when data contamination is low. The benchmark is publicly released to support further research.