benchmark

MedQADE

benchmarkactiveprovisionalmedqade-3a579e94·1 events·first seen 31h ago

Aliases: MedQADE

Co-occurring entities

Clinician-Level Agreement Without Clinical Caution: LLM Evaluator Limits in Medical AI Benchmarking Gemini 3 Flash

More like this (12)

MedQA MedMCQA QVal QVal QAGS StrategyQA QIMMA PQuAD GQA PubMedQA VQA-RAD QUBRIC

Recent events (1)

6arXiv · cs.CL·31h ago·source ↗

MedQADE benchmark reveals LLM evaluators match physician agreement scores but lack clinical caution and show lineage bias

Researchers introduce MedQADE, a standardized open-response clinical benchmark for German comprising 3,800 items annotated by ten physicians and nine LLM evaluators. The top LLM evaluator (Gemini 3 Flash) reached statistical alignment near the physician inter-rater ceiling (κ=0.694 vs. κ=0.709), but automated evaluators showed near-zero clinical metacognition: unlike physicians, they never abstained regardless of item difficulty. The study also documents systematic lineage-dependent scoring bias, where models preferentially rate architectural siblings more favorably, independent of language.

Evaluation and Benchmarking AI Safety Research Clinician-Level Agreement Without Clinical Caution: LLM Evaluator Limits in Medical AI Benchmarking MedQADE Gemini 3 Flash