4arXiv cs.CL (Computation and Language)·6d ago

Study finds lower bitrate discrete speech representations sufficient for generative spoken language modeling

Researchers investigate how segmentation width and cluster size affect speech resynthesis and continuation quality in Generative Spoken Language Models (GSLM), which train language models on discrete speech units without text. They find that intelligible, natural speech can be synthesized at lower bitrates than the standard baseline, and that continuation quality remains stable at reduced bitrates, suggesting conventional GSLM settings may be over-specified. The paper also notes that LLM-based evaluation metrics correlate better with human judgments than conventional metrics, but correlation remains low, pointing to a gap in automatic evaluation for speech generation.

Evaluation and Benchmarking Multimodal Progress generative language modeling On the Effect of Segmentation Width and Cluster Size on Speech Resynthesis and Continuation in Generative Spoken Language Models K-means

Related guides (2)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·26d ago·source ↗

Synthetic LLM-generated conversations improve ASR training for low-resource languages

Researchers propose a pipeline that uses LLMs to generate scenario-level dialogues and TTS to synthesize multi-speaker audio, creating simulated conversational training data for ASR systems. Evaluated on the Hungarian BEA-Dialogue benchmark, a model trained on 67 hours of real plus 636 hours of synthetic data outperforms a zero-shot model trained on 2,700 hours of real Hungarian speech. The study tests five LLM families under multiple budget and mixing configurations using a FastConformer-Large backbone, finding that generator choice and data composition significantly affect gains.

Evaluation and Benchmarking FastConformer-Large Efficient ASR Training with Conversations that Never Happened BEA-Dialogue

5arXiv · cs.CL·18d ago·source ↗

Study finds optimal speech token frame rate for aligning speech with text-native LLM reasoning

Researchers identify a temporal-granularity mismatch as a key cause of reasoning degradation in spoken dialogue models: speech tokens are far longer than text under matched semantics, diluting per-token semantic density. The paper introduces factorized FSQ and a non-autoregressive audio LM head to enable low frame rates, then sweeps frame rates from 50Hz down to 2.08Hz under a frozen LLM backbone. Results show a consistent optimal regime at 4.17Hz with intermediate-layer representation alignment for speech QA tasks.

Evaluation and Benchmarking Multimodal Progress Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation factorized FSQ

6arXiv · cs.CL·1mo ago·source ↗

Statistical Re-evaluation of GSM-Symbolic Finds Benchmark Confounds and Overstated Reasoning Conclusions

A re-evaluation of the GSM-Symbolic benchmark (Mirzadeh et al., 2025) challenges its conclusion that LLMs lack genuine reasoning capabilities. Using Generalised Linear Mixed Models on 20 open-weight models, the authors find only half show statistically significant performance drops, and identify a previously unacknowledged distributional shift toward larger integers in GSM-Symbolic relative to GSM8K that accounts for significance in roughly half the remaining cases. After controlling for this confound, model-specific failure profiles emerge—including variable binding fragility, arithmetic limitations, and dual-task interference—suggesting the original blanket claims about LLM reasoning were both statistically premature and mechanistically misleading.

Frontier Model Releases Evaluation and Benchmarking Mirzadeh et al. 2025 Generalised Linear Mixed Models GSM-Symbolic +1 more

5arXiv · cs.AI·13d ago·source ↗

Controlled ablation reveals training artifact behind low frame rate degradation in neural audio codecs

A new arXiv preprint investigates why neural audio codecs degrade sharply at low frame rates (≤6.25 Hz), a property relevant to autoregressive speech synthesis where generation cost scales with sequence length. The authors reproduce a previously reported quality cliff at 6.25 Hz and show it stems from a suboptimal training configuration—fixed clip duration starves the decoder of inter-token context at low frame rates—rather than fundamental phonemic or codebook limits. After correcting the training setup, word error rate degrades smoothly down to 1.6 Hz, suggesting low frame rate codecs are more practically accessible than prior work implied.

Inference Economics Multimodal Progress Probing Low Frame Rate Degradation in Neural Audio Codecs

5arXiv · cs.CL·7d ago·source ↗

Interleaved speech-text LMs implicitly transcribe speech in intermediate layers before predicting in text space

A new arXiv paper analyzes the internal mechanisms of interleaved speech-text language models using the logit lens, revealing that these models undergo an implicit transcription phase in intermediate layers where the text token of a spoken word becomes decodable despite no explicit speech recognition training. This transcription appears as a top candidate word for up to 77% of the data, after which the model predicts the next word in text space before converting back to speech. The findings illuminate how speech and text modalities interact in the latent space of SLMs and have implications for optimizing speech language model training.

Evaluation and Benchmarking Multimodal Progress Interleaved Speech Language Models Latently Work In Text logit lens

5arXiv · cs.LG·1mo ago·source ↗

Language Generation in the Limit with Bounded Memory: Characterization via Sperner's Theorem

This paper studies language generation in the limit under bounded memory constraints, extending classical learning theory to the generation setting. The authors characterize when memoryless generation is possible, derive minimax density bounds using Sperner's theorem and symmetric chain decompositions, and show that adaptively chosen memory outperforms sliding-window memory. They also revisit incremental identification in the limit, finding that exact identification fails for collections of three or more languages but an approximate relaxation is achievable for all finite collections.

Evaluation and Benchmarking AI Safety Research Sperner's Theorem Language Generation in the Limit Identification in the Limit +2 more

4arXiv · cs.CL·17d ago·source ↗

Audio-LLM-based data filtering for speech-to-speech translation via Rank-to-Distill

A new arXiv paper proposes using audio large language models to filter noisy training data for end-to-end speech-to-speech translation (S2ST). The authors introduce a two-stage Rank-to-Distill strategy: a lightweight ranker generates pseudo-labels from noisy speech pairs, which then supervise an audio-LLM to make keep/drop decisions directly from raw audio. Experiments on CVSS-C and SpeechMatrix benchmarks show up to +1.4 ASR-BLEU improvement over unfiltered baselines.

Evaluation and Benchmarking Multimodal Progress Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data SpeechMatrix CVSS-C +1 more

6arXiv · cs.CL·28d ago·source ↗

Trajectory Analysis of Masked Diffusion LMs for Graph-to-Text Generation with Lambda-Scaled Structural Decoding

This paper presents the first systematic study of masked diffusion language models (MDLMs) for graph-to-text generation, analyzing the order in which tokens are unmasked during iterative decoding. The authors find MDLMs naturally unmask entities first, then relational/function words, then structural tokens—a pattern disrupted by supervised fine-tuning, which prematurely anchors structural tokens and causes hallucination or omission. They propose lambda-scaled structural decoding, a training-free inference-time fix that recovers +9.4 BLEU-4, and introduce Graph-LLaDA, which integrates a Graph Transformer encoder into LLaDA's decoding process. Cross-dataset evaluation on the LAGRANGE benchmark shows prior baselines overfit to dataset-specific patterns while MDLM-based approaches generalize better.

Frontier Model Releases Evaluation and Benchmarking BLEU-4 Graph Transformer Diffusion Language Models +5 more