4arXiv cs.CL (Computation and Language)·19d ago

Sentence-Level Clinical Provenance Categorization for Multidisciplinary Hospital Summarization Using Fine-Tuned Llama-3

This pilot study presents a pipeline for categorizing sentence-level clinical provenance across multi-source hospital notes, targeting structured summarization in high-complexity settings like the NICU. The authors fine-tune Llama-3 8B and 70B models on MedSecId (MIMIC-III annotations), achieving Macro F1 above 92% in-domain. Cross-domain evaluation reveals a scale-dependent transfer effect: SFT substantially improves the 70B model (+7% Macro F1) but yields only marginal gains for the 8B model. A quantized fine-tuned 70B model outperforms its full-precision baseline while reducing compute, suggesting quantized adaptation is viable for structured clinical NLP tasks.

Inference Economics Enterprise Deployment Patterns MIMIC-III Llama 3.1 70B quantization Macro-F1 MedSecId Llama-3.1-8B supervised fine-tuning

Related guides (3)

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

supervised fine-tuningConcept

Supervised Fine-Tuning: Teaching an AI to Do Your Job

Read asBeginner In-depth

Related events (8)

4arXiv · cs.CL·1mo ago·source ↗

Automated ICD Classification of Psychiatric Diagnoses Using NLP and LLMs

This study evaluates NLP and ML approaches for automating the mapping of free-text psychiatric descriptions to ICD diagnostic codes, using a dataset of 145,513 Spanish clinical records. Methods range from classical BoW/TF-IDF representations to transformer-based embeddings including e5_large, BioLORD, and Llama-3-8B. Fine-tuned e5_large achieved the best performance with a micro-F1 of 0.866, outperforming classical methods by capturing semantic nuance and medical terminology. The work highlights challenges of long-tail label distributions and ambiguity specific to psychiatric clinical language.

Enterprise Deployment Patterns Agent and Tool Ecosystem International Classification of Diseases (ICD)e5_large Bag of Words (BoW)+3 more

5arXiv · cs.AI·24d ago·source ↗

Reverse Probing: Supervised Token-level Uncertainty Quantification for LLMs in Clinical Text

The paper introduces Reverse Probing, a novel uncertainty quantification framework designed specifically for clinical text summarization that estimates token-level uncertainty from pre-existing labeled summaries rather than sampling new outputs. It extracts uncertainty signals from four categories of internal model activations, treating text as a probe into the model's internal state. Evaluated on two expert-annotated clinical datasets, it outperforms eight adapted baselines on all metrics, achieving up to 4× higher AUPRC while reducing inference time and compute. Feature analysis identifies delta energy and neighborhood context as the most consistent predictors of uncertainty across models.

Evaluation and Benchmarking AI Safety Research Reverse Probing delta energy AUPRC +3 more

4arXiv · cs.CL·1mo ago·source ↗

Risk-Aware Hybrid Selective Classification for HIV Suspicion Identification in Spanish Clinical Notes

This paper proposes a hybrid selective classification framework for clinical NLP that explicitly handles both aleatoric and epistemic uncertainty to avoid overconfident predictions in medical triage settings. The system combines Mondrian conformal prediction with a Multi-Centroid Mahalanobis Distance veto, evaluated on HIV suspicion identification in Spanish clinical notes. The authors demonstrate that standard uncertainty metrics and baseline classifiers suffer coverage collapse under strict reliability constraints, while their dual-verification approach isolates a trustworthy operational domain. The work critiques inflated benchmark metrics that arise from forcing deterministic classification on inherently ambiguous clinical instances.

Evaluation and Benchmarking AI Safety Research HIV Suspicion Identification Mondrian Conformal Prediction Selective Classification +3 more

5arXiv · cs.CL·23d ago·source ↗

MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

The paper introduces a pipeline for converting unstructured clinical text into HL7 FHIR R4 bundles, enabling evaluation of LLMs in realistic electronic health record settings. Applied to the MedCaseReasoning dataset, it produces MedCase-Structured, a synthetic benchmark achieving valid FHIR generation for 82.5% of cases. Key finding: LLMs show consistently lower diagnostic accuracy on structured FHIR inputs compared to plain text, underscoring the gap between standard benchmarks and real-world clinical deployment conditions.

Evaluation and Benchmarking Enterprise Deployment Patterns HL7 FHIR R4 large language models MedCase-Structured +1 more

6arXiv · cs.CL·12d ago·source ↗

Clinically grounded privacy evaluation framework reveals high memorization risk in medical LMs

Researchers introduce a tiered adversarial framework for evaluating privacy leakage in medical language models, moving beyond simple training-text recovery to realistic clinical threat models. Applied to an LM pretrained on 378k clinical notes, the framework finds that routine encounter metadata (name, DOB, provider, visit date) elicits high verbatim memorization and sensitive-diagnosis recovery (AUROC 0.91 for abortion, 0.81 for HIV). The study also finds that exact-match memorization overstates disclosure risk because 36% of memorized tokens reflect templated documentation. The work provides a practical contextual privacy evaluation methodology for medical LMs trained on longitudinal patient data.

Evaluation and Benchmarking AI Safety Research Clinically Grounded Privacy Evaluation of Medical LMs +1 more

4arXiv · cs.CL·20d ago·source ↗

IndicBERT-HPA: Reliability-Oriented Multilingual Orthopedic Decision Support with Selective Verification Deferral

This paper presents a framework for classifying free-text orthopedic clinical notes in English, Hindi, and Punjabi, introducing IndicBERT-HPA, a domain-adaptive encoder augmented with language-aware orthopedic adapter heads. The system is evaluated against multilingual transformers, a DistilBERT baseline, and zero-shot LLMs, with zero-shot LLMs found substantially less effective than task-adapted encoders for closed-set clinical classification. IndicBERT-HPA achieves Macro-F1 of 0.8792 and AUPRC of 0.902 under natural clinical prevalence. A deterministic selective-verification layer combining confidence gating, evidence-consistency checking, and language-risk screening improves accuracy from 71.5% to 84.4% at 72.3% coverage on a 5,000-record held-out set.

Evaluation and Benchmarking Enterprise Deployment Patterns confidence gating language-aware adapter heads IndicBERT-HPA +3 more

5arXiv · cs.CL·23d ago·source ↗

LLUMI: Fine-Tuning Open-Source LLMs for Mental Health Writing Assistance Using Reddit Community Feedback

LLUMI is a two-component system (a generation model and an improvement model) designed to provide mental health writing assistance using smaller open-source LLMs hosted in privacy-preserving, on-premise environments. The system leverages Reddit community endorsement signals (upvotes/downvotes) to construct preference pairs for SFT and DPO training, then further aligns outputs via human evaluation across readability, empathy, connection, actionability, and safety dimensions. Results show LLUMI achieves performance comparable to proprietary GPT-based models on linguistic and human evaluations, suggesting community-derived preference signals can substitute for expensive expert labeling in sensitive domains.

Open Weights Progress AI Safety Research Reddit LLUMI Direct Preference Optimization (DPO)+3 more

5arXiv · cs.CL·3d ago·source ↗

ClaMPAPP: Hybrid LLM-ML system uses language models as interfaces for pediatric appendicitis diagnosis

Researchers introduce ClaMPAPP, a hybrid clinical decision support system that uses an LLM solely for structured feature extraction from free-text clinical notes, then passes validated features to an XGBoost classifier for final diagnosis. Evaluated on two independent German pediatric appendicitis cohorts, ClaMPAPP outperformed end-to-end LLM baselines on diagnostic performance and showed greater robustness to narrative reordering. The work formalizes an 'LLM-as-interface, ML-as-predictor' design pattern that separates natural-language usability from predictive inference, offering a more auditable pathway for clinical AI.

Enterprise Deployment Patterns Agent and Tool Ecosystem XGBoost ClaMPAPP