benchmark
MedQADE
benchmarkactiveprovisional
medqade-3a579e94·1 events·first seen 31h agoAliases: MedQADE
Co-occurring entities
More like this (12)
Recent events (1)
MedQADE benchmark reveals LLM evaluators match physician agreement scores but lack clinical caution and show lineage bias
Researchers introduce MedQADE, a standardized open-response clinical benchmark for German comprising 3,800 items annotated by ten physicians and nine LLM evaluators. The top LLM evaluator (Gemini 3 Flash) reached statistical alignment near the physician inter-rater ceiling (κ=0.694 vs. κ=0.709), but automated evaluators showed near-zero clinical metacognition: unlike physicians, they never abstained regardless of item difficulty. The study also documents systematic lineage-dependent scoring bias, where models preferentially rate architectural siblings more favorably, independent of language.