4arXiv cs.CL (Computation and Language)·43h ago

CTC oracle gap anatomy: acoustic scoring saturates, linguistic MBR decoding recovers WER

A new arXiv paper systematically diagnoses why CTC-internal N-best rescoring fails to improve over greedy decoding on LibriSpeech, showing that blank-path proliferation causes a 53% degradation in rank correlation between CTC scores and WER as beam size grows. The authors demonstrate that the bottleneck is linguistic rather than acoustic: MBR decoding with RoBERTa pseudo-log-likelihood achieves 9% relative WER reduction on LibriSpeech test-other and generalizes across two architectures and three domains. The paper also analyzes MWER sequence-level fine-tuning failure at near-converged checkpoints, attributing collapse to a vanishingly small training oracle gap.

Evaluation and Benchmarking RoBERTa LibriSpeech The Anatomy of the CTC Oracle Gap: Acoustic Exhaustion and Linguistic Recovery TED-LIUM 3 Minimum Bayes Risk Decoding VoxPopuli

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.CL·22d ago·source ↗

SN-WER: Script-Normalized Word Error Rate for Multi-Script Indic ASR Evaluation

Researchers propose Script-Normalized WER (SN-WER), a training-free evaluation metric that transliterates ASR reference and hypothesis text into a canonical script before computing WER, addressing overestimation of errors caused by script mismatches in multilingual settings. Evaluated across 5 Indic languages, 2 datasets, and 3 ASR models, SN-WER reduces inflated model performance gaps by up to 12% on curated FLEURS data and attenuates romanization-induced WER inflation by 67% in controlled tests. The metric maintains near-identical sensitivity to genuine semantic errors (ΔSN-WER/ΔWER ≈ 1.09) and shows robustness to transliterator choice with token-collision rates below 0.1%. The authors recommend SN-WER as a companion metric to WER and CER, particularly for pipelines feeding downstream search, indexing, or multilingual LLM applications.

Evaluation and Benchmarking Multimodal Progress FLEURS Common Voice Character Error Rate +2 more

5arXiv · cs.AI·8d ago·source ↗

Controlled ablation reveals training artifact behind low frame rate degradation in neural audio codecs

A new arXiv preprint investigates why neural audio codecs degrade sharply at low frame rates (≤6.25 Hz), a property relevant to autoregressive speech synthesis where generation cost scales with sequence length. The authors reproduce a previously reported quality cliff at 6.25 Hz and show it stems from a suboptimal training configuration—fixed clip duration starves the decoder of inter-token context at low frame rates—rather than fundamental phonemic or codebook limits. After correcting the training setup, word error rate degrades smoothly down to 1.6 Hz, suggesting low frame rate codecs are more practically accessible than prior work implied.

Inference Economics Multimodal Progress Probing Low Frame Rate Degradation in Neural Audio Codecs

5arXiv · cs.CL·16d ago·source ↗

VSR models outperform humans on lipreading benchmarks but rely on language cues, not visual perception

A new arXiv paper compares three visual speech recognition (VSR) systems against human lipreaders on the MaFI dataset using word, character, phoneme, and viseme-level metrics. Despite higher overall accuracy, VSR models succeed and fail on different words than humans, and their errors are better explained by training word frequency than visual informativeness. A text-only n-gram baseline given minimal phoneme input rivals human performance, suggesting VSR systems primarily exploit language priors rather than genuine visual speech perception. The findings raise questions about whether benchmark-beating performance reflects the capability it purports to measure.

Evaluation and Benchmarking Multimodal Progress MaFI The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

3arXiv · cs.CL·29d ago·source ↗

Thaka Wins KSAA-2026 Arabic Speech Diacritization Task with Regularized Fine-Tuning of CATT-Whisper

The Thaka team describes their winning system for Task 2 of the KSAA-2026 Shared Task on Arabic Speech Dictation with Automatic Diacritization, which requires producing fully diacritized Arabic text from speech audio and undiacritized transcripts. Their approach fine-tunes CATT-Whisper, a multimodal model combining a CATT text encoder with a frozen Whisper speech encoder, under severe data constraints (2,327 training samples, no external data). Key techniques include R-Drop consistency regularization, Optuna-optimized hyperparameters with high weight decay, Focal Loss, and Monte Carlo Dropout inference averaging over 200 stochastic forward passes across four checkpoints. The system achieves 23.26% WER on the primary metric, placing first among all participants.

Multimodal Progress Optuna Focal Loss CATT +6 more

6arXiv · cs.CL·29d ago·source ↗

Semantic vs. Surface Noise in LLM Agents: 68-Cell Measurement Study with Held-Out Validation

This paper documents an empirical phenomenon across 10 LLMs from 7 architecture families: meaning-bearing perturbations (paraphrase, synonym substitution) cause final-answer inconsistency ~19.69 percentage points more often than presentation-level perturbations (formatting, reordering) of comparable severity, across GSM8K, MATH, and HotpotQA benchmarks. The effect is validated on a held-out 11th model (qwen2.5-14B-Instruct) with 1,800 trajectories. Trace-level analysis supports a 'stealth-divergence' picture where semantic perturbations preserve the first action but induce divergence in intermediate reasoning steps, while two prior mechanism claims are explicitly retracted. The study is notable for its honest reporting of stress-test failures and pre-registered replication.

Evaluation and Benchmarking AI Safety Research Qwen2.5-7B-Instruct-1M ReAct stealth-divergence +5 more

6arXiv · cs.CL·9d ago·source ↗

Computational audit finds ClinicalBERT amplifies demographic bias beyond training data distributions

Researchers present a systematic audit of representational bias in ClinicalBERT, a BERT-based model pretrained on MIMIC-III clinical discharge summaries, using two probing methodologies: Log Probability Bias Analysis and Masked Language Model probing across 98 clinical sentence templates and eight intersectional race-gender combinations. Of 32 statistically significant findings, 65.6% contradict observed corpus distributions, rising to 80% for Black patients and 87.5% for agency attribution under MLM probing. The key finding is that bias in ClinicalBERT operates predominantly through model-internal amplification rather than simple inheritance from training data, which has direct implications for clinical AI safety and deployment. This challenges the assumption that auditing training corpora is sufficient to characterize model bias.

Evaluation and Benchmarking AI Safety Research A Computational Audit of Demographic Association Encoding in ClinicalBERT Language Predictions MIMIC-III ClinicalBERT +1 more

4arXiv · cs.CL·12d ago·source ↗

Audio-LLM-based data filtering for speech-to-speech translation via Rank-to-Distill

A new arXiv paper proposes using audio large language models to filter noisy training data for end-to-end speech-to-speech translation (S2ST). The authors introduce a two-stage Rank-to-Distill strategy: a lightweight ranker generates pseudo-labels from noisy speech pairs, which then supervise an audio-LLM to make keep/drop decisions directly from raw audio. Experiments on CVSS-C and SpeechMatrix benchmarks show up to +1.4 ASR-BLEU improvement over unfiltered baselines.

Evaluation and Benchmarking Multimodal Progress Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data SpeechMatrix CVSS-C +1 more

4arXiv · cs.CL·2d ago·source ↗

Variance-Calibrated Modulation (VCM): training-free decoding intervention to address LLM likelihood trap

Researchers propose Variance-Calibrated Modulation (VCM), a training-free pre-decoding method that reshapes LLM probability distributions before truncation to combat repetitive degeneration and vocabulary dullness. VCM combines two mechanisms: Contextual Searchlight via PMI (suppressing stopwords, elevating context-relevant tokens) and Adaptive Self-Debiasing (scale-invariant penalization using real-time logit standard deviation). Evaluated across open-ended generation, factual QA, and mathematical reasoning, VCM improves diversity, coherence, and reasoning accuracy at higher temperatures with negligible overhead. The method is compatible with existing decoding strategies like Top-p and Min-p.

Evaluation and Benchmarking Inference Economics Adaptive Self-Debiasing Variance-Calibrated Modulation Contextual Searchlight via PMI