3arXiv cs.CL (Computation and Language)·Jun 15, 2026

Continual learning approach for disfluency-aware ASR with explicit disfluency tokens

A new arXiv preprint addresses the challenge of transcribing disfluent speech (hesitations, repetitions, fillers) in ASR systems, which typically omit such markers causing information loss. The authors introduce explicit disfluency tokens into a pretrained ASR model and apply continual learning to adapt across datasets with varying disfluency distributions while mitigating catastrophic forgetting. The work identifies a trade-off between disfluency marker learning and general ASR performance, and finds a consistent cross-attention head mechanism shared across continual learning methods.

Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR

Related events (8)

4arXiv · cs.CL·Jul 7, 2026·source ↗

REDDIT framework corrects timestamp drift in autoregressive ASR without catastrophic forgetting

A new arXiv paper introduces REDDIT (Replay-based Distribution EDITing), a two-stage post-training framework that corrects timestamp drift in autoregressive ASR systems like Whisper across long non-speech spans. The method updates only 1.6% of model parameters and constructs correction supervision without human annotations, using VAD-trimmed speech with inserted non-speech gaps. On Whisper-tiny, long-gap mIoU improves from 38.7% to 95.0% and out-of-domain alignment error drops from 2752 ms to 223 ms, while preserving transcription quality that ordinary SFT decoder tuning catastrophically degrades.

Evaluation and Benchmarking Reddit REDDIT: Correcting Model-Generated Timestamp Drift in ASR without Forgetting via Replay-Based Distribution Editing Whisper

4arXiv · cs.CL·Jun 15, 2026·source ↗

MoDiCoL: A modular continual learning dataset for diagnosing ASR robustness under distribution shift

Researchers introduce MoDiCoL, a benchmark dataset designed to evaluate automatic speech recognition robustness under co-occurring real-world distribution shifts including accents, recording conditions, speech impairments, and noise. Unlike existing benchmarks that isolate these factors, MoDiCoL enables controlled analysis across linguistic, speaker, and acoustic dimensions simultaneously. The paper also proposes a continual learning curriculum simulating incremental updates and evaluates three continual learning strategies for robustness acquisition and forgetting.

Evaluation and Benchmarking MoDiCoL

5arXiv · cs.CL·Jun 16, 2026·source ↗

ASRD: Training-free anchor-guided revocable decoding for diffusion LLMs improves accuracy and throughput

A new arXiv preprint introduces ASRD (Anchor Supervised Revocable Decoding), a training-free framework for improving decoding quality in diffusion large language models. The method addresses error propagation and local error reinforcement in revocable decoding by separating trusted 'anchor tokens' (identified via temporal consistency) from uncertain candidates, then applying anchor-guided generation and anchor-perturbed verification. Experiments on math and coding benchmarks show up to 6.4% accuracy improvement and 7.2× inference throughput gains over remasking baselines.

Inference Economics ASRD Follow the Latent Roadmap: Navigating Revocable Decoding for Diffusion LLMs with Anchor Tokens

3arXiv · cs.CL·Jul 7, 2026·source ↗

Iterative pseudo-labeling approach improves Mandarin-English code-switching ASR

A new arXiv preprint introduces a three-phase iterative pseudo-labeling framework for code-switching automatic speech recognition (ASR), applied here to Mandarin-English mixing. The method generates pseudo-labels from unlabeled corpora, trains a bilingual model in two stages, and iteratively refines it, achieving Mix Error Rate reductions of 6.35% and 8.29% on the SEAME benchmark's devman and devsge subsets. This is the first application of iterative pseudo-labeling to code-switching ASR, addressing the chronic data scarcity problem in this domain.

Evaluation and Benchmarking SEAME Progressive Refinement: An Iterative Pseudo-Labeling Approach for Mandarin-English Code-Switching ASR

5arXiv · cs.LG·Jun 5, 2026·source ↗

SARDI: Self-Augmenting Retrieval for Diffusion Language Models using lookahead tokens

Researchers introduce SARDI, a training-free RAG framework for discrete diffusion language models that repurposes discarded low-confidence tokens during denoising as lookahead signals to guide retrieval before output is finalized. The method is retriever-agnostic and applicable to any reasoning-capable discrete diffusion LM. Evaluated across five multi-hop QA benchmarks, SARDI outperforms training-free diffusion and autoregressive retrieval baselines at up to 8x higher throughput.

Evaluation and Benchmarking Agent and Tool Ecosystem Self-Augmenting Retrieval for Diffusion Language Models SARDI

3arXiv · cs.CL·Jul 20, 2026·source ↗

Novel training criterion reduces shortcut reliance in L2 spoken English automated scoring systems

A new arXiv paper introduces a training criterion designed to reduce shortcut reliance in automated language proficiency assessment systems, targeting both audio-based and ASR-text-based scoring pipelines. The authors demonstrate that transformer-based auto-markers over-rely on exploitable features relative to human raters, creating malpractice opportunities for test-takers. The proposed criterion reduces this over-reliance, bringing model behavior closer to human reference correlations.

Evaluation and Benchmarking Controlling Implicit Shortcut Reliance in L2 Spoken English Auto-markers

4arXiv · cs.CL·Jun 8, 2026·source ↗

Acoustic cue alignment tokens improve speech emotion recognition in audio language models

Researchers study whether instruction-following audio language models (ALMs) use explicit acoustic cues in a grounded way when raw audio is already available. They derive six interpretable acoustic concept tokens from the eGeMAPS feature set and append them to text prompts, testing on FAU-Aibo and IEMOCAP benchmarks. Aligned tokens improve unweighted average recall while shuffled or corrupted tokens degrade performance, but models don't fully collapse under perturbation, indicating partial anchoring to the audio signal. The work offers a practical probing method for interpretability and robustness in affective computing with ALMs.

Evaluation and Benchmarking Multimodal Progress FAU-Aibo Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition IEMOCAP +1 more

3arXiv · cs.CL·Jun 30, 2026·source ↗

Case study compares human and ASR system performance on Dutch dysarthric speech recognition

A new arXiv preprint compares human listeners against three off-the-shelf ASR systems (Whisper-large-V3, Google Chirp 3, and Omnilingual) on recognizing continuous Dutch speech from a single speaker with severe dysarthria. Both humans and ASR systems exceeded 70% WER on average, confirming the extreme difficulty of dysarthric speech recognition. Fine-tuning on dysarthric speech substantially reduced WER, with personalized models outperforming human listeners, though WER remained above 23%. The study highlights the need for personalized ASR approaches for dysarthric speakers.

Omnilingual Google Chirp 3 Whisper large-v3

Continual learning approach for disfluency-aware ASR with explicit disfluency tokens

Related events (8)

4arXiv · cs.CL·Jul 7, 2026·source ↗