3arXiv cs.CL (Computation and Language)·3d ago

Case study compares human and ASR system performance on Dutch dysarthric speech recognition

A new arXiv preprint compares human listeners against three off-the-shelf ASR systems (Whisper-large-V3, Google Chirp 3, and Omnilingual) on recognizing continuous Dutch speech from a single speaker with severe dysarthria. Both humans and ASR systems exceeded 70% WER on average, confirming the extreme difficulty of dysarthric speech recognition. Fine-tuning on dysarthric speech substantially reduced WER, with personalized models outperforming human listeners, though WER remained above 23%. The study highlights the need for personalized ASR approaches for dysarthric speakers.

Omnilingual Google Chirp 3 Whisper large-v3

Related events (8)

4arXiv · cs.CL·2d ago·source ↗

Personalized fine-tuning of Whisper achieves 9.7% WER on dysarthric speech

Researchers adapted Whisper to a single dysarthric speaker using up to 100.8 hours of read speech and user corrections collected via a mobile app, reducing word error rate from a high baseline to 9.7%. Fine-tuning outperformed LoRA adaptation and the Qwen3-ASR foundation model in this personalized setting. The study demonstrates that speaker-specific fine-tuning of foundation ASR models can reach practical deployment quality for dysarthric users.

LoRA Qwen3-ASR Whisper +1 more

5arXiv · cs.CL·25d ago·source ↗

VSR models outperform humans on lipreading benchmarks but rely on language cues, not visual perception

A new arXiv paper compares three visual speech recognition (VSR) systems against human lipreaders on the MaFI dataset using word, character, phoneme, and viseme-level metrics. Despite higher overall accuracy, VSR models succeed and fail on different words than humans, and their errors are better explained by training word frequency than visual informativeness. A text-only n-gram baseline given minimal phoneme input rivals human performance, suggesting VSR systems primarily exploit language priors rather than genuine visual speech perception. The findings raise questions about whether benchmark-beating performance reflects the capability it purports to measure.

Evaluation and Benchmarking Multimodal Progress MaFI The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

8Openai Blog·1mo ago·source ↗

Introducing Whisper

OpenAI introduced Whisper, an open-source automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. The model demonstrates strong robustness to accents, background noise, and technical language, approaching human-level accuracy in English transcription. Whisper supports transcription in multiple languages as well as translation to English, and the weights and inference code were released publicly.

Open Weights Progress Agent and Tool Ecosystem OpenAI Whisper +1 more

3arXiv · cs.CL·15d ago·source ↗

Speech-based dementia screening using Whisper embeddings to compensate for nonverbal subtest omissions

Researchers present a speech-based evaluation system for the German Syndrom-Kurz-Test dementia screening battery, combining transcript-derived scores with Whisper embeddings to reduce transcription scoring errors. The system also approximates expert overall ratings even when motor (nonverbal) subtests are omitted, addressing a key accessibility limitation of speech-only assessment. Models show strong correlation with expert ratings and effective discrimination between cognitive status groups.

Syndrom-Kurz-Test Whisper

4Github Trending·1mo ago·source ↗

FunASR: Industrial-Grade Speech Recognition Toolkit with 170x Realtime Performance

FunASR is an open-source speech recognition toolkit from ModelScope supporting 50+ languages, speaker diarization, emotion detection, and streaming inference at 170x realtime speed. It exposes an OpenAI-compatible API, positioning it as a drop-in alternative for production ASR workloads. The repository has accumulated 16,317 stars with modest daily momentum (+42 today).

Open Weights Progress Agent and Tool Ecosystem FunASR ModelScope OpenAI-compatible API

4Hugging Face Blog·1mo ago·source ↗

Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers

This Hugging Face blog post provides a practical guide for fine-tuning OpenAI's Whisper model for multilingual automatic speech recognition using the Transformers library. It covers dataset preparation, training configuration, and evaluation using the Word Error Rate metric. The post targets practitioners seeking to adapt Whisper to low-resource or domain-specific languages.

Open Weights Progress Agent and Tool Ecosystem Hugging Face Transformers Hugging Face Word Error Rate +2 more

4arXiv · cs.CL·10d ago·source ↗

CTC oracle gap anatomy: acoustic scoring saturates, linguistic MBR decoding recovers WER

A new arXiv paper systematically diagnoses why CTC-internal N-best rescoring fails to improve over greedy decoding on LibriSpeech, showing that blank-path proliferation causes a 53% degradation in rank correlation between CTC scores and WER as beam size grows. The authors demonstrate that the bottleneck is linguistic rather than acoustic: MBR decoding with RoBERTa pseudo-log-likelihood achieves 9% relative WER reduction on LibriSpeech test-other and generalizes across two architectures and three domains. The paper also analyzes MWER sequence-level fine-tuning failure at near-converged checkpoints, attributing collapse to a vanishingly small training oracle gap.

Evaluation and Benchmarking RoBERTa LibriSpeech The Anatomy of the CTC Oracle Gap: Acoustic Exhaustion and Linguistic Recovery +3 more

3arXiv · cs.CL·1mo ago·source ↗

Thaka Wins KSAA-2026 Arabic Speech Diacritization Task with Regularized Fine-Tuning of CATT-Whisper

The Thaka team describes their winning system for Task 2 of the KSAA-2026 Shared Task on Arabic Speech Dictation with Automatic Diacritization, which requires producing fully diacritized Arabic text from speech audio and undiacritized transcripts. Their approach fine-tunes CATT-Whisper, a multimodal model combining a CATT text encoder with a frozen Whisper speech encoder, under severe data constraints (2,327 training samples, no external data). Key techniques include R-Drop consistency regularization, Optuna-optimized hyperparameters with high weight decay, Focal Loss, and Monte Carlo Dropout inference averaging over 200 stochastic forward passes across four checkpoints. The system achieves 23.26% WER on the primary metric, placing first among all participants.

Multimodal Progress Optuna Focal Loss CATT +6 more