MedMCQA
medmcqa-c38ab92c·2 events·first seen 28d agoAliases: MedMCQA
Co-occurring entities
More like this (12)
Recent events (2)
Systematic evaluation of LLM prompt sensitivity in healthcare settings reveals safety risks
Researchers conduct a sensitivity analysis of both general-purpose and medical-specific LLMs using the MedMCQA benchmark, testing robustness to lexical and syntactic prompt perturbations. The study finds that even minor phrasing changes can alter clinical advice, and adversarial prompts can produce dangerous outputs such as incorrect dosages or omitted critical findings. Both general-purpose models (GPT-3.5, Llama 3) and domain-specific models (ClinicalBERT, BioLlama3, BioBERT) exhibit this fragility, with syntactic reordering and misleading contextual cues proving more destabilizing than simple paraphrasing.
The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare
Hugging Face has launched the Open Medical-LLM Leaderboard, a public benchmark for evaluating large language models on healthcare and medical tasks. The leaderboard aggregates performance across multiple medical question-answering datasets to enable standardized comparison of open-weight models in clinical and biomedical domains. This initiative aims to accelerate progress in medical AI by providing transparent, reproducible evaluation infrastructure.