Entity · paper

Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?

paperactivereassessing-high-performing-llms-on-polish-medical-exams-true-competence-or-bias-driven-performance--f8db2972·1 events·first seen Jun 11, 2026

Aliases: Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?

Co-occurring entities

Qwen3.5-122B

More like this (12)

Measuring Epistemic Resilience of LLMs Under Misleading Medical Context Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA When the Judge Changes, So Does the Measurement: Auditing LLM-as-Judge Reliability Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA LLM-judged explanation score Clinician-Level Agreement Without Clinical Caution: LLM Evaluator Limits in Medical AI Benchmarking LLM-judge scoring Moral Safety in LLMs: Exposing Performative Compliance with Puzzled Cues What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?Rating the Pitch, Not the Product: User Evaluations of LLMs Reflect Expectations More Than Performance Can LLMs Reliably Self-Report Adversarial Prefills, and How?

Recent events (1)

5arXiv · cs.CL·Jun 11, 2026·source ↗

New Polish medical exam benchmark reveals MCQA overestimates LLM clinical competence

Researchers introduce an expanded Polish medical exam benchmark with over 15,000 new questions, two new domains, and four structural modifications designed to reduce multiple-choice artifacts and better test reasoning. Evaluating 21 LLMs under the harder setup, the best-performing model (Qwen3.5-122B) drops 28-31 percentage points compared to standard MCQA scores. The findings suggest standard MCQA benchmarks do not reliably reflect true medical competence, even when data contamination is low. The benchmark is publicly released to support further research.

Evaluation and Benchmarking Qwen3.5-122B Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?