paper
Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?
paperactiveprovisional
reassessing-high-performing-llms-on-polish-medical-exams-true-competence-or-bias-driven-performance--f8db2972·1 events·first seen 6d agoAliases: Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?
Co-occurring entities
More like this (12)
Measuring Epistemic Resilience of LLMs Under Misleading Medical ContextMind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?LLM-judged explanation scoreLLM-judge scoringLLM evaluationLLM-as-a-JudgeOpen Medical-LLM LeaderboardClinically Grounded Privacy Evaluation of Medical LMsArtificial Analysis LLM Performance LeaderboardThe Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMsOpen Leaderboard for Japanese LLMsLLM Debate Competition
Recent events (1)
New Polish medical exam benchmark reveals MCQA overestimates LLM clinical competence
Researchers introduce an expanded Polish medical exam benchmark with over 15,000 new questions, two new domains, and four structural modifications designed to reduce multiple-choice artifacts and better test reasoning. Evaluating 21 LLMs under the harder setup, the best-performing model (Qwen3.5-122B) drops 28-31 percentage points compared to standard MCQA scores. The findings suggest standard MCQA benchmarks do not reliably reflect true medical competence, even when data contamination is low. The benchmark is publicly released to support further research.