Almanac
← Events
3arXiv cs.CL (Computation and Language)·2d ago

Continual learning approach for disfluency-aware ASR with explicit disfluency tokens

A new arXiv preprint addresses the challenge of transcribing disfluent speech (hesitations, repetitions, fillers) in ASR systems, which typically omit such markers causing information loss. The authors introduce explicit disfluency tokens into a pretrained ASR model and apply continual learning to adapt across datasets with varying disfluency distributions while mitigating catastrophic forgetting. The work identifies a trade-off between disfluency marker learning and general ASR performance, and finds a consistent cross-attention head mechanism shared across continual learning methods.

Related events (8)

4arXiv · cs.CL·2d ago·source ↗

MoDiCoL: A modular continual learning dataset for diagnosing ASR robustness under distribution shift

Researchers introduce MoDiCoL, a benchmark dataset designed to evaluate automatic speech recognition robustness under co-occurring real-world distribution shifts including accents, recording conditions, speech impairments, and noise. Unlike existing benchmarks that isolate these factors, MoDiCoL enables controlled analysis across linguistic, speaker, and acoustic dimensions simultaneously. The paper also proposes a continual learning curriculum simulating incremental updates and evaluates three continual learning strategies for robustness acquisition and forgetting.

5arXiv · cs.CL·29h ago·source ↗

ASRD: Training-free anchor-guided revocable decoding for diffusion LLMs improves accuracy and throughput

A new arXiv preprint introduces ASRD (Anchor Supervised Revocable Decoding), a training-free framework for improving decoding quality in diffusion large language models. The method addresses error propagation and local error reinforcement in revocable decoding by separating trusted 'anchor tokens' (identified via temporal consistency) from uncertain candidates, then applying anchor-guided generation and anchor-perturbed verification. Experiments on math and coding benchmarks show up to 6.4% accuracy improvement and 7.2× inference throughput gains over remasking baselines.

5arXiv · cs.LG·12d ago·source ↗

SARDI: Self-Augmenting Retrieval for Diffusion Language Models using lookahead tokens

Researchers introduce SARDI, a training-free RAG framework for discrete diffusion language models that repurposes discarded low-confidence tokens during denoising as lookahead signals to guide retrieval before output is finalized. The method is retriever-agnostic and applicable to any reasoning-capable discrete diffusion LM. Evaluated across five multi-hop QA benchmarks, SARDI outperforms training-free diffusion and autoregressive retrieval baselines at up to 8x higher throughput.

4arXiv · cs.CL·9d ago·source ↗

Acoustic cue alignment tokens improve speech emotion recognition in audio language models

Researchers study whether instruction-following audio language models (ALMs) use explicit acoustic cues in a grounded way when raw audio is already available. They derive six interpretable acoustic concept tokens from the eGeMAPS feature set and append them to text prompts, testing on FAU-Aibo and IEMOCAP benchmarks. Aligned tokens improve unweighted average recall while shuffled or corrupted tokens degrade performance, but models don't fully collapse under perturbation, indicating partial anchoring to the audio signal. The work offers a practical probing method for interpretability and robustness in affective computing with ALMs.

4arXiv · cs.AI·2d ago·source ↗

LEAF-X: Entropy-guided explainability framework for transformer-based ASR models

Researchers introduce LEAF-X (Listening with Entropy-guided Attention for Faithful explainability), a model-intrinsic XAI framework for transformer-based automatic speech recognition systems like Whisper. The method combines entropy-guided attention weighting, multi-layer attention rollout, and optional causal ablations to produce sparse token-to-frame attributions. Evaluations show 32% improved faithfulness and 35-39% stronger locality/sparsity compared to perturbation-based explainers and raw attention maps, enabling more auditable ASR.

5arXiv · cs.CL·6d ago·source ↗

Systematic study reveals effectiveness-fluency trade-offs in LLM conditioning methods

A new arXiv paper systematically evaluates a range of LLM conditioning methods across both concept injection and removal scenarios, finding that efficient steering methods often degrade fluency significantly. A key finding is that activation steering is substantially less effective on instruction-tuned models than on base models, a previously overlooked interaction. Simple prompting and supervised fine-tuning work for concept injection but not removal, and cheap textual metrics are found to correlate well with expensive LLM-as-judge evaluations.

6arXiv · cs.LG·22d ago·source ↗

Self-Generated Replay Nearly Eliminates Catastrophic Forgetting in Language Models

This paper investigates catastrophic forgetting in language models during continual learning, finding that models can use self-generated samples from their own training distribution as effective replay data, nearly eliminating forgetting without requiring stored exemplars. The authors identify two key conditions where forgetting persists: when models are pretrained near capacity saturation (leaving no room for new knowledge), and when low learning rates are used to reduce forgetting at the cost of requiring far more training steps. Self-generated replay breaks this learning-rate/forgetting tradeoff, enabling fast high-learning-rate finetuning without degradation on prior tasks.

5arXiv · cs.CL·7d ago·source ↗

RL-based alignment improves interactivity in full-duplex spoken dialogue models

Researchers propose a post-training alignment method using reinforcement learning to improve interactivity in full-duplex spoken dialogue models, which can listen and speak simultaneously. The method addresses four canonical axes of interactivity—pause handling, turn-taking, backchanneling, and user interruption—each with axis-specific reward functions, plus an LLM-based reward to prevent semantic degradation. The approach is applied to two open-source models, Moshi and PersonaPlex, showing consistent improvements in both offline and real-time multi-turn evaluation.