4arXiv cs.CL (Computation and Language)·1mo ago

Automated ICD Classification of Psychiatric Diagnoses Using NLP and LLMs

This study evaluates NLP and ML approaches for automating the mapping of free-text psychiatric descriptions to ICD diagnostic codes, using a dataset of 145,513 Spanish clinical records. Methods range from classical BoW/TF-IDF representations to transformer-based embeddings including e5_large, BioLORD, and Llama-3-8B. Fine-tuned e5_large achieved the best performance with a micro-F1 of 0.866, outperforming classical methods by capturing semantic nuance and medical terminology. The work highlights challenges of long-tail label distributions and ambiguity specific to psychiatric clinical language.

Enterprise Deployment Patterns Agent and Tool Ecosystem International Classification of Diseases (ICD)e5_large Bag of Words (BoW)TF-IDF BioLORD Llama-3.1-8B

Related guides (2)

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

4arXiv · cs.CL·18d ago·source ↗

Sentence-Level Clinical Provenance Categorization for Multidisciplinary Hospital Summarization Using Fine-Tuned Llama-3

This pilot study presents a pipeline for categorizing sentence-level clinical provenance across multi-source hospital notes, targeting structured summarization in high-complexity settings like the NICU. The authors fine-tune Llama-3 8B and 70B models on MedSecId (MIMIC-III annotations), achieving Macro F1 above 92% in-domain. Cross-domain evaluation reveals a scale-dependent transfer effect: SFT substantially improves the 70B model (+7% Macro F1) but yields only marginal gains for the 8B model. A quantized fine-tuned 70B model outperforms its full-precision baseline while reducing compute, suggesting quantized adaptation is viable for structured clinical NLP tasks.

Inference Economics Enterprise Deployment Patterns MIMIC-III Llama 3.1 70B quantization +4 more

4arXiv · cs.CL·3d ago·source ↗

LLMs predict dementia and depression severity from clinical interview transcripts in zero-shot and feature-extraction settings

Researchers evaluate three open-weights LLMs (Mistral 3.1, DeepHermes, Qwen3) for predicting dementia and depression severity from speech transcripts of 154 German-speaking patients in standardized clinical interviews. The study introduces a new observer-based Global Depression Scale (GDS-D) and tests both zero-shot prediction and LLM-based feature extraction for Support Vector Regression. Zero-shot performs well for depression (MAE 0.60), while structured feature extraction reduces dementia assessment error by up to 35%; pause-enriched automatic transcripts match human transcription quality, suggesting viable fully-automated screening pipelines.

Evaluation and Benchmarking Open Weights Progress DeepHermes Qwen3 Global Deterioration Scale +2 more

5arXiv · cs.CL·2d ago·source ↗

ClaMPAPP: Hybrid LLM-ML system uses language models as interfaces for pediatric appendicitis diagnosis

Researchers introduce ClaMPAPP, a hybrid clinical decision support system that uses an LLM solely for structured feature extraction from free-text clinical notes, then passes validated features to an XGBoost classifier for final diagnosis. Evaluated on two independent German pediatric appendicitis cohorts, ClaMPAPP outperformed end-to-end LLM baselines on diagnostic performance and showed greater robustness to narrative reordering. The work formalizes an 'LLM-as-interface, ML-as-predictor' design pattern that separates natural-language usability from predictive inference, offering a more auditable pathway for clinical AI.

Enterprise Deployment Patterns Agent and Tool Ecosystem XGBoost ClaMPAPP

4arXiv · cs.CL·11d ago·source ↗

Dep-LLM: Training-free depression diagnosis framework using structured multi-factor LLM reasoning

Dep-LLM is a training-free framework for automatic depression detection from clinical interviews that uses frozen foundation LLMs without fine-tuning. The system decomposes long clinical dialogues into five thematic factors via Chain-of-Thought analysis, applies token-level entropy-based confidence modulation, and integrates multi-factor signals for final diagnosis. Evaluated on DAIC-WOZ and E-DAIC datasets, it outperforms zero-shot baselines across 21 foundation LLMs and surpasses supervised domain-specific and commercial LLMs on multiple metrics.

Evaluation and Benchmarking Agent and Tool Ecosystem Chain-of-Thought Reasoning Dep-LLM DAIC-WOZ +1 more

4arXiv · cs.CL·1mo ago·source ↗

Risk-Aware Hybrid Selective Classification for HIV Suspicion Identification in Spanish Clinical Notes

This paper proposes a hybrid selective classification framework for clinical NLP that explicitly handles both aleatoric and epistemic uncertainty to avoid overconfident predictions in medical triage settings. The system combines Mondrian conformal prediction with a Multi-Centroid Mahalanobis Distance veto, evaluated on HIV suspicion identification in Spanish clinical notes. The authors demonstrate that standard uncertainty metrics and baseline classifiers suffer coverage collapse under strict reliability constraints, while their dual-verification approach isolates a trustworthy operational domain. The work critiques inflated benchmark metrics that arise from forcing deterministic classification on inherently ambiguous clinical instances.

Evaluation and Benchmarking AI Safety Research HIV Suspicion Identification Mondrian Conformal Prediction Selective Classification +3 more

5arXiv · cs.CL·22d ago·source ↗

LLUMI: Fine-Tuning Open-Source LLMs for Mental Health Writing Assistance Using Reddit Community Feedback

LLUMI is a two-component system (a generation model and an improvement model) designed to provide mental health writing assistance using smaller open-source LLMs hosted in privacy-preserving, on-premise environments. The system leverages Reddit community endorsement signals (upvotes/downvotes) to construct preference pairs for SFT and DPO training, then further aligns outputs via human evaluation across readability, empathy, connection, actionability, and safety dimensions. Results show LLUMI achieves performance comparable to proprietary GPT-based models on linguistic and human evaluations, suggesting community-derived preference signals can substitute for expensive expert labeling in sensitive domains.

Open Weights Progress AI Safety Research Reddit LLUMI Direct Preference Optimization (DPO)+3 more

4arXiv · cs.CL·18d ago·source ↗

LLM-Assisted Discovery of ADHD Signals in Turkish Teacher Narratives Beyond Rating Scales

This study analyzes de-identified Turkish teacher evaluation forms from clinical ADHD assessments, comparing predictive signals from structured rating scales (CTRS-R:S) and open-ended teacher narratives. The authors find that structured and narrative information encode complementary signals, with minimal overlap between cases missed by each modality. An LLM-assisted theme discovery pipeline reveals distinct attention, behavioral, and family-related patterns in narratives that structured scales miss, demonstrating NLP's potential to augment traditional ADHD screening.

Evaluation and Benchmarking Enterprise Deployment Patterns LLM-assisted theme discovery pipeline Natural Language Processing Conners' Teacher Rating Scale-Revised Short Form +1 more

4arXiv · cs.CL·25d ago·source ↗

Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech

This paper presents the first NLP-based dementia detection study for Filipino speech, constructing a parallel bilingual dataset of 4,000 DementiaBank-derived transcripts with manual Filipino translations. Five model families are evaluated across monolingual, zero-shot cross-lingual, and bilingual fine-tuning settings. English-trained BERT degrades sharply on Filipino (Macro-F1 = 0.455), but bilingual fine-tuning recovers performance to Macro-F1 = 0.969–0.973 across all transformer models. The key finding is that multilingual clinical NLP performance is driven by linguistic coverage during training rather than model scale or architecture.

Evaluation and Benchmarking TF-IDF + Logistic Regression NeoBERT DementiaBank +4 more