Almanac
paper

Clinician-Level Agreement Without Clinical Caution: LLM Evaluator Limits in Medical AI Benchmarking

paperactiveprovisionalclinician-level-agreement-without-clinical-caution-llm-evaluator-limits-in-medical-ai-benchmarking-06b91441·1 events·first seen 32h ago

Aliases: Clinician-Level Agreement Without Clinical Caution: LLM Evaluator Limits in Medical AI Benchmarking

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·32h ago·source ↗

MedQADE benchmark reveals LLM evaluators match physician agreement scores but lack clinical caution and show lineage bias

Researchers introduce MedQADE, a standardized open-response clinical benchmark for German comprising 3,800 items annotated by ten physicians and nine LLM evaluators. The top LLM evaluator (Gemini 3 Flash) reached statistical alignment near the physician inter-rater ceiling (κ=0.694 vs. κ=0.709), but automated evaluators showed near-zero clinical metacognition: unlike physicians, they never abstained regardless of item difficulty. The study also documents systematic lineage-dependent scoring bias, where models preferentially rate architectural siblings more favorably, independent of language.