4arXiv cs.CL (Computation and Language)·19d ago

Evidence-Augmented ML for Self-Harm Surveillance in Emergency Department Triage Notes

Researchers developed a three-stage pipeline combining traditional machine learning with LLM-based screening and evidence extraction to detect self-harm in Australian emergency department triage notes. The system achieved AUPRCs around 0.88 in both internal and external validation, and transferred to two external hospital sites without site-specific retraining. A notable capability is identifying the primary self-harm method with 95% accuracy, enabling more granular public health surveillance beyond binary classification.

Enterprise Deployment Patterns AUPRC LLM-augmented clinical NLP pipeline evidence extraction Australian Emergency Department triage notes

Related guides (1)

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·4d ago·source ↗

Fine-tuning LLMs to passively estimate depression severity from AI mental health conversations

Researchers fine-tune a Qwen3.5-27B model with a regression head to predict PHQ-9 depression severity scores directly from AI mental health app conversation transcripts, eliminating the need for explicit self-report completion. The training set of 6,283 users combines 3,111 ground-truth labels with pseudolabels generated by Claude Opus and iterative intermediate models. On a held-out test of 842 users, the best model achieves MAE=2.6, Pearson r=0.80, and AUC=0.91 at the clinical PHQ-9≥10 threshold, with AUC>0.87 across all severity thresholds. The work demonstrates a passive, continuous symptom-monitoring approach that could reduce response bias in mental health platforms.

Enterprise Deployment Patterns Claude Opus 4.6 Patient Health Questionnaire-9 Qwen3.6-27B +1 more

5arXiv · cs.CL·19d ago·source ↗

Systematic Evaluation of LLM Safety Failures on Eating Disorder Queries with Clinician Feedback

This paper investigates how LLMs respond to queries from users with eating disorders, finding that specific linguistic cues in prompts increase the likelihood of unsafe model responses. Working with clinical ED experts, the authors systematically vary risk levels in user prompts to measure the extent to which LLMs uncritically adapt to potentially dangerous inputs. The study highlights a gap between perceived model safety and actual harm facilitation in sensitive health contexts.

Evaluation and Benchmarking AI Safety Research clinical ED experts large language models eating disorder safety evaluation

5arXiv · cs.CL·3d ago·source ↗

ClaMPAPP: Hybrid LLM-ML system uses language models as interfaces for pediatric appendicitis diagnosis

Researchers introduce ClaMPAPP, a hybrid clinical decision support system that uses an LLM solely for structured feature extraction from free-text clinical notes, then passes validated features to an XGBoost classifier for final diagnosis. Evaluated on two independent German pediatric appendicitis cohorts, ClaMPAPP outperformed end-to-end LLM baselines on diagnostic performance and showed greater robustness to narrative reordering. The work formalizes an 'LLM-as-interface, ML-as-predictor' design pattern that separates natural-language usability from predictive inference, offering a more auditable pathway for clinical AI.

Enterprise Deployment Patterns Agent and Tool Ecosystem XGBoost ClaMPAPP

4arXiv · cs.CL·19d ago·source ↗

Sentence-Level Clinical Provenance Categorization for Multidisciplinary Hospital Summarization Using Fine-Tuned Llama-3

This pilot study presents a pipeline for categorizing sentence-level clinical provenance across multi-source hospital notes, targeting structured summarization in high-complexity settings like the NICU. The authors fine-tune Llama-3 8B and 70B models on MedSecId (MIMIC-III annotations), achieving Macro F1 above 92% in-domain. Cross-domain evaluation reveals a scale-dependent transfer effect: SFT substantially improves the 70B model (+7% Macro F1) but yields only marginal gains for the 8B model. A quantized fine-tuned 70B model outperforms its full-precision baseline while reducing compute, suggesting quantized adaptation is viable for structured clinical NLP tasks.

Inference Economics Enterprise Deployment Patterns MIMIC-III Llama 3.1 70B quantization +4 more

6arXiv · cs.CL·12d ago·source ↗

Clinically grounded privacy evaluation framework reveals high memorization risk in medical LMs

Researchers introduce a tiered adversarial framework for evaluating privacy leakage in medical language models, moving beyond simple training-text recovery to realistic clinical threat models. Applied to an LM pretrained on 378k clinical notes, the framework finds that routine encounter metadata (name, DOB, provider, visit date) elicits high verbatim memorization and sensitive-diagnosis recovery (AUROC 0.91 for abortion, 0.81 for HIV). The study also finds that exact-match memorization overstates disclosure risk because 36% of memorized tokens reflect templated documentation. The work provides a practical contextual privacy evaluation methodology for medical LMs trained on longitudinal patient data.

Evaluation and Benchmarking AI Safety Research Clinically Grounded Privacy Evaluation of Medical LMs +1 more

4arXiv · cs.CL·2d ago·source ↗

MedRLM: Recursive multimodal agent framework for long-context clinical decision support

MedRLM is a proposed framework for clinical decision support that uses recursive multi-agent reasoning over heterogeneous patient data including EHRs, medical images, physiological sensor streams, and clinical guidelines. Rather than single-step prompting, it decomposes patient cases into an inspectable external environment coordinated by specialized agents, with a Clinical Evidence Graph Memory and sensor-triggered deeper reasoning. The paper outlines an evaluation design using public and credentialed clinical datasets spanning radiology, ECG, ICU time series, and referral outcomes. The work targets a gap between static medical QA benchmarks and real-world longitudinal clinical workflows.

Agent and Tool Ecosystem Multimodal Progress MedRLM Clinical Evidence Graph Memory

4arXiv · cs.CL·19d ago·source ↗

LLM-Assisted Discovery of ADHD Signals in Turkish Teacher Narratives Beyond Rating Scales

This study analyzes de-identified Turkish teacher evaluation forms from clinical ADHD assessments, comparing predictive signals from structured rating scales (CTRS-R:S) and open-ended teacher narratives. The authors find that structured and narrative information encode complementary signals, with minimal overlap between cases missed by each modality. An LLM-assisted theme discovery pipeline reveals distinct attention, behavioral, and family-related patterns in narratives that structured scales miss, demonstrating NLP's potential to augment traditional ADHD screening.

Evaluation and Benchmarking Enterprise Deployment Patterns LLM-assisted theme discovery pipeline Natural Language Processing Conners' Teacher Rating Scale-Revised Short Form +1 more

5arXiv · cs.CL·23d ago·source ↗

LLUMI: Fine-Tuning Open-Source LLMs for Mental Health Writing Assistance Using Reddit Community Feedback

LLUMI is a two-component system (a generation model and an improvement model) designed to provide mental health writing assistance using smaller open-source LLMs hosted in privacy-preserving, on-premise environments. The system leverages Reddit community endorsement signals (upvotes/downvotes) to construct preference pairs for SFT and DPO training, then further aligns outputs via human evaluation across readability, empathy, connection, actionability, and safety dimensions. Results show LLUMI achieves performance comparable to proprietary GPT-based models on linguistic and human evaluations, suggesting community-derived preference signals can substitute for expensive expert labeling in sensitive domains.

Open Weights Progress AI Safety Research Reddit LLUMI Direct Preference Optimization (DPO)+3 more