5arXiv cs.CL (Computation and Language)·2d ago

ClaMPAPP: Hybrid LLM-ML system uses language models as interfaces for pediatric appendicitis diagnosis

Researchers introduce ClaMPAPP, a hybrid clinical decision support system that uses an LLM solely for structured feature extraction from free-text clinical notes, then passes validated features to an XGBoost classifier for final diagnosis. Evaluated on two independent German pediatric appendicitis cohorts, ClaMPAPP outperformed end-to-end LLM baselines on diagnostic performance and showed greater robustness to narrative reordering. The work formalizes an 'LLM-as-interface, ML-as-predictor' design pattern that separates natural-language usability from predictive inference, offering a more auditable pathway for clinical AI.

Enterprise Deployment Patterns Agent and Tool Ecosystem XGBoost ClaMPAPP

Related guides (2)

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·12d ago·source ↗

LLM-guided MAP-Elites evolution improves medical decision pipelines at inference time

Researchers propose using LLM-guided MAP-Elites evolutionary search as an inference-time alternative to fine-tuning for adapting LLMs to clinical workflows, formulating triage, consultation, and image classification as evolutionary searches over executable artifacts. Across three medical settings, evolved programs substantially outperform manually designed baselines: triage accuracy improves from 77.3% to 87.1% and emergency recall from 0.60 to 0.97, with gains also shown on MIMIC-ESI, iCRAFTMD, and PneumoniaMNIST. The approach works across Llama-3, Qwen-3.5, and Gemma-4 backbones and produces interpretable program-level mechanisms rather than superficial prompt changes.

Evaluation and Benchmarking Agent and Tool Ecosystem Gemma-4 E4B-it MIMIC-ESI iCRAFTMD +6 more

5arXiv · cs.CL·12d ago·source ↗

Systematic evaluation of LLM prompt sensitivity in healthcare settings reveals safety risks

Researchers conduct a sensitivity analysis of both general-purpose and medical-specific LLMs using the MedMCQA benchmark, testing robustness to lexical and syntactic prompt perturbations. The study finds that even minor phrasing changes can alter clinical advice, and adversarial prompts can produce dangerous outputs such as incorrect dosages or omitted critical findings. Both general-purpose models (GPT-3.5, Llama 3) and domain-specific models (ClinicalBERT, BioLlama3, BioBERT) exhibit this fragility, with syntactic reordering and misleading contextual cues proving more destabilizing than simple paraphrasing.

Evaluation and Benchmarking AI Safety Research BioLlama3 BioBERT MedMCQA +3 more

4arXiv · cs.CL·1mo ago·source ↗

Automated ICD Classification of Psychiatric Diagnoses Using NLP and LLMs

This study evaluates NLP and ML approaches for automating the mapping of free-text psychiatric descriptions to ICD diagnostic codes, using a dataset of 145,513 Spanish clinical records. Methods range from classical BoW/TF-IDF representations to transformer-based embeddings including e5_large, BioLORD, and Llama-3-8B. Fine-tuned e5_large achieved the best performance with a micro-F1 of 0.866, outperforming classical methods by capturing semantic nuance and medical terminology. The work highlights challenges of long-tail label distributions and ambiguity specific to psychiatric clinical language.

Enterprise Deployment Patterns Agent and Tool Ecosystem International Classification of Diseases (ICD)e5_large Bag of Words (BoW)+3 more

6arXiv · cs.CL·11d ago·source ↗

Clinically grounded privacy evaluation framework reveals high memorization risk in medical LMs

Researchers introduce a tiered adversarial framework for evaluating privacy leakage in medical language models, moving beyond simple training-text recovery to realistic clinical threat models. Applied to an LM pretrained on 378k clinical notes, the framework finds that routine encounter metadata (name, DOB, provider, visit date) elicits high verbatim memorization and sensitive-diagnosis recovery (AUROC 0.91 for abortion, 0.81 for HIV). The study also finds that exact-match memorization overstates disclosure risk because 36% of memorized tokens reflect templated documentation. The work provides a practical contextual privacy evaluation methodology for medical LMs trained on longitudinal patient data.

Evaluation and Benchmarking AI Safety Research Clinically Grounded Privacy Evaluation of Medical LMs +1 more

4arXiv · cs.CL·47h ago·source ↗

MedRLM: Recursive multimodal agent framework for long-context clinical decision support

MedRLM is a proposed framework for clinical decision support that uses recursive multi-agent reasoning over heterogeneous patient data including EHRs, medical images, physiological sensor streams, and clinical guidelines. Rather than single-step prompting, it decomposes patient cases into an inspectable external environment coordinated by specialized agents, with a Clinical Evidence Graph Memory and sensor-triggered deeper reasoning. The paper outlines an evaluation design using public and credentialed clinical datasets spanning radiology, ECG, ICU time series, and referral outcomes. The work targets a gap between static medical QA benchmarks and real-world longitudinal clinical workflows.

Agent and Tool Ecosystem Multimodal Progress MedRLM Clinical Evidence Graph Memory

5Hugging Face Blog·1mo ago·source ↗

The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare

Hugging Face has launched the Open Medical-LLM Leaderboard, a public benchmark for evaluating large language models on healthcare and medical tasks. The leaderboard aggregates performance across multiple medical question-answering datasets to enable standardized comparison of open-weight models in clinical and biomedical domains. This initiative aims to accelerate progress in medical AI by providing transparent, reproducible evaluation infrastructure.

Evaluation and Benchmarking Open Weights Progress PubMedQA Open Medical-LLM Leaderboard MedMCQA +3 more

4arXiv · cs.CL·11d ago·source ↗

Dep-LLM: Training-free depression diagnosis framework using structured multi-factor LLM reasoning

Dep-LLM is a training-free framework for automatic depression detection from clinical interviews that uses frozen foundation LLMs without fine-tuning. The system decomposes long clinical dialogues into five thematic factors via Chain-of-Thought analysis, applies token-level entropy-based confidence modulation, and integrates multi-factor signals for final diagnosis. Evaluated on DAIC-WOZ and E-DAIC datasets, it outperforms zero-shot baselines across 21 foundation LLMs and surpasses supervised domain-specific and commercial LLMs on multiple metrics.

Evaluation and Benchmarking Agent and Tool Ecosystem Chain-of-Thought Reasoning Dep-LLM DAIC-WOZ +1 more

4arXiv · cs.CL·18d ago·source ↗

Sentence-Level Clinical Provenance Categorization for Multidisciplinary Hospital Summarization Using Fine-Tuned Llama-3

This pilot study presents a pipeline for categorizing sentence-level clinical provenance across multi-source hospital notes, targeting structured summarization in high-complexity settings like the NICU. The authors fine-tune Llama-3 8B and 70B models on MedSecId (MIMIC-III annotations), achieving Macro F1 above 92% in-domain. Cross-domain evaluation reveals a scale-dependent transfer effect: SFT substantially improves the 70B model (+7% Macro F1) but yields only marginal gains for the 8B model. A quantized fine-tuned 70B model outperforms its full-precision baseline while reducing compute, suggesting quantized adaptation is viable for structured clinical NLP tasks.

Inference Economics Enterprise Deployment Patterns MIMIC-III Llama 3.1 70B quantization +4 more