3arXiv cs.CL (Computation and Language)·2d ago

Speech-based dementia screening using Whisper embeddings to compensate for nonverbal subtest omissions

Researchers present a speech-based evaluation system for the German Syndrom-Kurz-Test dementia screening battery, combining transcript-derived scores with Whisper embeddings to reduce transcription scoring errors. The system also approximates expert overall ratings even when motor (nonverbal) subtests are omitted, addressing a key accessibility limitation of speech-only assessment. Models show strong correlation with expert ratings and effective discrimination between cognitive status groups.

Syndrom-Kurz-Test Whisper

Related events (8)

4arXiv · cs.CL·3d ago·source ↗

LLMs predict dementia and depression severity from clinical interview transcripts in zero-shot and feature-extraction settings

Researchers evaluate three open-weights LLMs (Mistral 3.1, DeepHermes, Qwen3) for predicting dementia and depression severity from speech transcripts of 154 German-speaking patients in standardized clinical interviews. The study introduces a new observer-based Global Depression Scale (GDS-D) and tests both zero-shot prediction and LLM-based feature extraction for Support Vector Regression. Zero-shot performs well for depression (MAE 0.60), while structured feature extraction reduces dementia assessment error by up to 35%; pause-enriched automatic transcripts match human transcription quality, suggesting viable fully-automated screening pipelines.

Evaluation and Benchmarking Open Weights Progress DeepHermes Qwen3 Global Deterioration Scale +2 more

4Hugging Face Blog·1mo ago·source ↗

Blazingly Fast Whisper Transcriptions with Inference Endpoints

Hugging Face published a blog post detailing optimized Whisper speech-to-text transcription deployments via their Inference Endpoints service. The post covers performance improvements using faster-whisper or similar optimized backends to achieve significantly reduced transcription latency. This is positioned as a practical deployment guide for production speech recognition workloads.

Inference Economics Enterprise Deployment Patterns Hugging Face Inference Endpoints Hugging Face faster-whisper +1 more

8Openai Blog·1mo ago·source ↗

Introducing Whisper

OpenAI introduced Whisper, an open-source automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. The model demonstrates strong robustness to accents, background noise, and technical language, approaching human-level accuracy in English transcription. Whisper supports transcription in multiple languages as well as translation to English, and the weights and inference code were released publicly.

Open Weights Progress Agent and Tool Ecosystem OpenAI Whisper +1 more

4Hugging Face Blog·1mo ago·source ↗

Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers

This Hugging Face blog post provides a practical guide for fine-tuning OpenAI's Whisper model for multilingual automatic speech recognition using the Transformers library. It covers dataset preparation, training configuration, and evaluation using the Word Error Rate metric. The post targets practitioners seeking to adapt Whisper to low-resource or domain-specific languages.

Open Weights Progress Agent and Tool Ecosystem Hugging Face Transformers Hugging Face Word Error Rate +2 more

4arXiv · cs.CL·25d ago·source ↗

Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech

This paper presents the first NLP-based dementia detection study for Filipino speech, constructing a parallel bilingual dataset of 4,000 DementiaBank-derived transcripts with manual Filipino translations. Five model families are evaluated across monolingual, zero-shot cross-lingual, and bilingual fine-tuning settings. English-trained BERT degrades sharply on Filipino (Macro-F1 = 0.455), but bilingual fine-tuning recovers performance to Macro-F1 = 0.969–0.973 across all transformer models. The key finding is that multilingual clinical NLP performance is driven by linguistic coverage during training rather than model scale or architecture.

Evaluation and Benchmarking TF-IDF + Logistic Regression NeoBERT DementiaBank +4 more

3arXiv · cs.CL·25d ago·source ↗

Thaka Wins KSAA-2026 Arabic Speech Diacritization Task with Regularized Fine-Tuning of CATT-Whisper

The Thaka team describes their winning system for Task 2 of the KSAA-2026 Shared Task on Arabic Speech Dictation with Automatic Diacritization, which requires producing fully diacritized Arabic text from speech audio and undiacritized transcripts. Their approach fine-tunes CATT-Whisper, a multimodal model combining a CATT text encoder with a frozen Whisper speech encoder, under severe data constraints (2,327 training samples, no external data). Key techniques include R-Drop consistency regularization, Optuna-optimized hyperparameters with high weight decay, Focal Loss, and Monte Carlo Dropout inference averaging over 200 stochastic forward passes across four checkpoints. The system achieves 23.26% WER on the primary metric, placing first among all participants.

Multimodal Progress Optuna Focal Loss CATT +6 more

5Hugging Face Blog·1mo ago·source ↗

Speculative Decoding for 2x Faster Whisper Inference

Hugging Face demonstrates applying speculative decoding to OpenAI's Whisper speech recognition model, achieving approximately 2x inference speedup. The technique uses a smaller draft model to propose token sequences that the larger target model then verifies, reducing the number of full forward passes required. This post covers implementation details using the Hugging Face Transformers library and benchmarks the approach across different hardware configurations.

Inference Economics Agent and Tool Ecosystem speculative decoding Hugging Face Transformers Hugging Face +2 more

6arXiv · cs.AI·12d ago·source ↗

Sparse AutoEncoder steering reduces Whisper hallucination rate by ~5x without fine-tuning

Researchers investigate hallucination detection and mitigation in OpenAI's Whisper ASR model by probing internal encoder representations. They find that both raw activations and Sparse AutoEncoder (SAE) latents encode linearly separable hallucination signals concentrated in deeper layers. SAE-based activation steering reduces hallucination rates from 72.6% to 14.1% (Whisper small) and 86.9% to 27.3% (Whisper large-v3) on non-speech audio, with minimal WER degradation, approaching fine-tuning-level performance without weight updates.

Evaluation and Benchmarking AI Safety Research Sparse Autoencoder OpenAI Whisper