6arXiv cs.CL (Computation and Language)·11d ago

Clinically grounded privacy evaluation framework reveals high memorization risk in medical LMs

Researchers introduce a tiered adversarial framework for evaluating privacy leakage in medical language models, moving beyond simple training-text recovery to realistic clinical threat models. Applied to an LM pretrained on 378k clinical notes, the framework finds that routine encounter metadata (name, DOB, provider, visit date) elicits high verbatim memorization and sensitive-diagnosis recovery (AUROC 0.91 for abortion, 0.81 for HIV). The study also finds that exact-match memorization overstates disclosure risk because 36% of memorized tokens reflect templated documentation. The work provides a practical contextual privacy evaluation methodology for medical LMs trained on longitudinal patient data.

Evaluation and Benchmarking AI Safety Research Enterprise Deployment Patterns Clinically Grounded Privacy Evaluation of Medical LMs

Related guides (3)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·12d ago·source ↗

Systematic evaluation of LLM prompt sensitivity in healthcare settings reveals safety risks

Researchers conduct a sensitivity analysis of both general-purpose and medical-specific LLMs using the MedMCQA benchmark, testing robustness to lexical and syntactic prompt perturbations. The study finds that even minor phrasing changes can alter clinical advice, and adversarial prompts can produce dangerous outputs such as incorrect dosages or omitted critical findings. Both general-purpose models (GPT-3.5, Llama 3) and domain-specific models (ClinicalBERT, BioLlama3, BioBERT) exhibit this fragility, with syntactic reordering and misleading contextual cues proving more destabilizing than simple paraphrasing.

Evaluation and Benchmarking AI Safety Research BioLlama3 BioBERT MedMCQA +3 more

7arXiv · cs.CL·9d ago·source ↗

MedMisBench: LLMs show fragile epistemic resilience under misleading medical context

Researchers introduce MedMisBench, a benchmark of 10,932 medical questions paired with 48,889 misleading context injections, to measure whether LLMs maintain correct medical judgment under adversarial pressure. Across 11 model configurations, mean accuracy drops from 71.1% to 38.0% when misleading context is injected, with authority-framed falsehoods achieving 69.5% attack success. A 14-member international clinical panel flagged serious potential harm in 38.2% of reviewed cases. The work argues that existing medical benchmarks measure knowledge but not robustness to manipulation, exposing a structural gap in LLM safety evaluation for healthcare.

Evaluation and Benchmarking AI Safety Research Measuring Epistemic Resilience of LLMs Under Misleading Medical Context MedMisBench

5arXiv · cs.CL·15d ago·source ↗

PropMe framework distinguishes memorization capability from propensity in LLMs

A new arXiv preprint introduces PropMe, a framework that separates whether LLMs can be forced to reproduce training data (capability) from whether they do so under ordinary use (propensity). The authors also release SimpleTrace, a lightweight pipeline using infini-gram to attribute model outputs to training corpora. Evaluating two open models on Common Pile and Dynaword, they find a consistent gap: adversarial prefix attacks elicit strong memorization, but propensity scores remain low in non-adversarial settings. The paper argues memorization audits should report both worst-case extractability and ordinary leakage propensity.

Evaluation and Benchmarking AI Safety Research PropMe SimpleTrace Dynaword +4 more

5arXiv · cs.CL·18d ago·source ↗

Systematic Evaluation of LLM Safety Failures on Eating Disorder Queries with Clinician Feedback

This paper investigates how LLMs respond to queries from users with eating disorders, finding that specific linguistic cues in prompts increase the likelihood of unsafe model responses. Working with clinical ED experts, the authors systematically vary risk levels in user prompts to measure the extent to which LLMs uncritically adapt to potentially dangerous inputs. The study highlights a gap between perceived model safety and actual harm facilitation in sensitive health contexts.

Evaluation and Benchmarking AI Safety Research clinical ED experts large language models eating disorder safety evaluation

5arXiv · cs.CL·2d ago·source ↗

ClaMPAPP: Hybrid LLM-ML system uses language models as interfaces for pediatric appendicitis diagnosis

Researchers introduce ClaMPAPP, a hybrid clinical decision support system that uses an LLM solely for structured feature extraction from free-text clinical notes, then passes validated features to an XGBoost classifier for final diagnosis. Evaluated on two independent German pediatric appendicitis cohorts, ClaMPAPP outperformed end-to-end LLM baselines on diagnostic performance and showed greater robustness to narrative reordering. The work formalizes an 'LLM-as-interface, ML-as-predictor' design pattern that separates natural-language usability from predictive inference, offering a more auditable pathway for clinical AI.

Enterprise Deployment Patterns Agent and Tool Ecosystem XGBoost ClaMPAPP

5arXiv · cs.AI·23d ago·source ↗

Reverse Probing: Supervised Token-level Uncertainty Quantification for LLMs in Clinical Text

The paper introduces Reverse Probing, a novel uncertainty quantification framework designed specifically for clinical text summarization that estimates token-level uncertainty from pre-existing labeled summaries rather than sampling new outputs. It extracts uncertainty signals from four categories of internal model activations, treating text as a probe into the model's internal state. Evaluated on two expert-annotated clinical datasets, it outperforms eight adapted baselines on all metrics, achieving up to 4× higher AUPRC while reducing inference time and compute. Feature analysis identifies delta energy and neighborhood context as the most consistent predictors of uncertainty across models.

Evaluation and Benchmarking AI Safety Research Reverse Probing delta energy AUPRC +3 more

6arXiv · cs.CL·12d ago·source ↗

LLM-guided MAP-Elites evolution improves medical decision pipelines at inference time

Researchers propose using LLM-guided MAP-Elites evolutionary search as an inference-time alternative to fine-tuning for adapting LLMs to clinical workflows, formulating triage, consultation, and image classification as evolutionary searches over executable artifacts. Across three medical settings, evolved programs substantially outperform manually designed baselines: triage accuracy improves from 77.3% to 87.1% and emergency recall from 0.60 to 0.97, with gains also shown on MIMIC-ESI, iCRAFTMD, and PneumoniaMNIST. The approach works across Llama-3, Qwen-3.5, and Gemma-4 backbones and produces interpretable program-level mechanisms rather than superficial prompt changes.

Evaluation and Benchmarking Agent and Tool Ecosystem Gemma-4 E4B-it MIMIC-ESI iCRAFTMD +6 more

5Hugging Face Blog·1mo ago·source ↗

The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare

Hugging Face has launched the Open Medical-LLM Leaderboard, a public benchmark for evaluating large language models on healthcare and medical tasks. The leaderboard aggregates performance across multiple medical question-answering datasets to enable standardized comparison of open-weight models in clinical and biomedical domains. This initiative aims to accelerate progress in medical AI by providing transparent, reproducible evaluation infrastructure.

Evaluation and Benchmarking Open Weights Progress PubMedQA Open Medical-LLM Leaderboard MedMCQA +3 more