4arXiv cs.CL (Computation and Language)·3d ago

LLMs predict dementia and depression severity from clinical interview transcripts in zero-shot and feature-extraction settings

Researchers evaluate three open-weights LLMs (Mistral 3.1, DeepHermes, Qwen3) for predicting dementia and depression severity from speech transcripts of 154 German-speaking patients in standardized clinical interviews. The study introduces a new observer-based Global Depression Scale (GDS-D) and tests both zero-shot prediction and LLM-based feature extraction for Support Vector Regression. Zero-shot performs well for depression (MAE 0.60), while structured feature extraction reduces dementia assessment error by up to 35%; pause-enriched automatic transcripts match human transcription quality, suggesting viable fully-automated screening pipelines.

Evaluation and Benchmarking Open Weights Progress DeepHermes Qwen3 Global Deterioration Scale Mistral 3.1 Global Depression Scale (GDS-D)

Related guides (2)

Open Weights ProgressTopic guide

Open Weights Progress: How Freely Available AI Models Caught Up to the Frontier

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.CL·11d ago·source ↗

Dep-LLM: Training-free depression diagnosis framework using structured multi-factor LLM reasoning

Dep-LLM is a training-free framework for automatic depression detection from clinical interviews that uses frozen foundation LLMs without fine-tuning. The system decomposes long clinical dialogues into five thematic factors via Chain-of-Thought analysis, applies token-level entropy-based confidence modulation, and integrates multi-factor signals for final diagnosis. Evaluated on DAIC-WOZ and E-DAIC datasets, it outperforms zero-shot baselines across 21 foundation LLMs and surpasses supervised domain-specific and commercial LLMs on multiple metrics.

Evaluation and Benchmarking Agent and Tool Ecosystem Chain-of-Thought Reasoning Dep-LLM DAIC-WOZ +1 more

5arXiv · cs.CL·3d ago·source ↗

Fine-tuning LLMs to passively estimate depression severity from AI mental health conversations

Researchers fine-tune a Qwen3.5-27B model with a regression head to predict PHQ-9 depression severity scores directly from AI mental health app conversation transcripts, eliminating the need for explicit self-report completion. The training set of 6,283 users combines 3,111 ground-truth labels with pseudolabels generated by Claude Opus and iterative intermediate models. On a held-out test of 842 users, the best model achieves MAE=2.6, Pearson r=0.80, and AUC=0.91 at the clinical PHQ-9≥10 threshold, with AUC>0.87 across all severity thresholds. The work demonstrates a passive, continuous symptom-monitoring approach that could reduce response bias in mental health platforms.

Enterprise Deployment Patterns Claude Opus 4.6 Patient Health Questionnaire-9 Qwen3.6-27B +1 more

5arXiv · cs.CL·22d ago·source ↗

LLUMI: Fine-Tuning Open-Source LLMs for Mental Health Writing Assistance Using Reddit Community Feedback

LLUMI is a two-component system (a generation model and an improvement model) designed to provide mental health writing assistance using smaller open-source LLMs hosted in privacy-preserving, on-premise environments. The system leverages Reddit community endorsement signals (upvotes/downvotes) to construct preference pairs for SFT and DPO training, then further aligns outputs via human evaluation across readability, empathy, connection, actionability, and safety dimensions. Results show LLUMI achieves performance comparable to proprietary GPT-based models on linguistic and human evaluations, suggesting community-derived preference signals can substitute for expensive expert labeling in sensitive domains.

Open Weights Progress AI Safety Research Reddit LLUMI Direct Preference Optimization (DPO)+3 more

3arXiv · cs.CL·2d ago·source ↗

Speech-based dementia screening using Whisper embeddings to compensate for nonverbal subtest omissions

Researchers present a speech-based evaluation system for the German Syndrom-Kurz-Test dementia screening battery, combining transcript-derived scores with Whisper embeddings to reduce transcription scoring errors. The system also approximates expert overall ratings even when motor (nonverbal) subtests are omitted, addressing a key accessibility limitation of speech-only assessment. Models show strong correlation with expert ratings and effective discrimination between cognitive status groups.

Syndrom-Kurz-Test Whisper

4arXiv · cs.CL·1mo ago·source ↗

Automated ICD Classification of Psychiatric Diagnoses Using NLP and LLMs

This study evaluates NLP and ML approaches for automating the mapping of free-text psychiatric descriptions to ICD diagnostic codes, using a dataset of 145,513 Spanish clinical records. Methods range from classical BoW/TF-IDF representations to transformer-based embeddings including e5_large, BioLORD, and Llama-3-8B. Fine-tuned e5_large achieved the best performance with a micro-F1 of 0.866, outperforming classical methods by capturing semantic nuance and medical terminology. The work highlights challenges of long-tail label distributions and ambiguity specific to psychiatric clinical language.

Enterprise Deployment Patterns Agent and Tool Ecosystem International Classification of Diseases (ICD)e5_large Bag of Words (BoW)+3 more

5arXiv · cs.CL·1mo ago·source ↗

Text Analytics Evaluation Framework: Benchmarking LLMs on Social Media NLP Tasks

Researchers introduce a 470-question evaluation framework to assess LLM performance on aggregated social media text, applied to Twitter datasets across sentiment analysis, hate speech detection, and emotion recognition. Results show performance degrades substantially as input scale exceeds 500 instances, particularly for open-weights models on numerical tasks. Multi-label and target-dependent scenarios also show notable performance drops, and task complexity progressively erodes accuracy from basic semantic identification to comparison and counting operations. The findings point to architectural bottlenecks in current LLMs for rigorous quantitative analysis over large text collections.

Long Context Evolution Evaluation and Benchmarking Emotion Recognition Text Analytics Evaluation Framework X (Twitter)+3 more

5arXiv · cs.CL·23d ago·source ↗

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

This paper systematically investigates strategies for extending LLM-based automatic evaluation (LLMs-as-a-Judge) to multilingual settings, covering high-, mid-, and low-resource languages (English, Spanish, Basque). The authors compare instruction translation, monolingual vs. multilingual supervision, and model size, finding that fine-tuned smaller models can match proprietary models when in-domain data is available, while zero-shot larger models are preferable out-of-domain. Two meta-evaluation datasets are extended to Spanish and Basque, and all data and code are publicly released.

Evaluation and Benchmarking Basque language LLM-as-a-Judge mJudge +2 more

5arXiv · cs.CL·15d ago·source ↗

LLMs fail to consistently simulate demographic perspective-taking in hate speech annotation

A new arXiv paper evaluates whether persona-conditioned LLMs can replicate how different demographic groups perceive hate speech, testing three dimensions: inter-group disagreement, in-group sensitivity, and vicarious prediction. No model consistently captures all three dimensions, and performance is highly model-dependent rather than emerging reliably from identity prompts alone. Vicarious prompting with Llama 3.1 provides the closest approximation to human disagreement patterns across demographic axes. The findings have implications for using LLMs as proxies for diverse human annotators in content moderation tasks.

Evaluation and Benchmarking AI Safety Research From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation Meta Llama-3.1-8B