5arXiv cs.CL (Computation and Language)·9d ago

Study finds optimal speech token frame rate for aligning speech with text-native LLM reasoning

Researchers identify a temporal-granularity mismatch as a key cause of reasoning degradation in spoken dialogue models: speech tokens are far longer than text under matched semantics, diluting per-token semantic density. The paper introduces factorized FSQ and a non-autoregressive audio LM head to enable low frame rates, then sweeps frame rates from 50Hz down to 2.08Hz under a frozen LLM backbone. Results show a consistent optimal regime at 4.17Hz with intermediate-layer representation alignment for speech QA tasks.

Evaluation and Benchmarking Multimodal Progress Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation factorized FSQ

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·4d ago·source ↗

Controlled ablation reveals training artifact behind low frame rate degradation in neural audio codecs

A new arXiv preprint investigates why neural audio codecs degrade sharply at low frame rates (≤6.25 Hz), a property relevant to autoregressive speech synthesis where generation cost scales with sequence length. The authors reproduce a previously reported quality cliff at 6.25 Hz and show it stems from a suboptimal training configuration—fixed clip duration starves the decoder of inter-token context at low frame rates—rather than fundamental phonemic or codebook limits. After correcting the training setup, word error rate degrades smoothly down to 1.6 Hz, suggesting low frame rate codecs are more practically accessible than prior work implied.

Inference Economics Multimodal Progress Probing Low Frame Rate Degradation in Neural Audio Codecs

5arXiv · cs.CL·10d ago·source ↗

RL-based alignment improves interactivity in full-duplex spoken dialogue models

Researchers propose a post-training alignment method using reinforcement learning to improve interactivity in full-duplex spoken dialogue models, which can listen and speak simultaneously. The method addresses four canonical axes of interactivity—pause handling, turn-taking, backchanneling, and user interruption—each with axis-specific reward functions, plus an LLM-based reward to prevent semantic degradation. The approach is applied to two open-source models, Moshi and PersonaPlex, showing consistent improvements in both offline and real-time multi-turn evaluation.

Alignment and RLHF Multimodal Progress Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models PersonaPlex Moshi

4arXiv · cs.CL·8d ago·source ↗

Audio-LLM-based data filtering for speech-to-speech translation via Rank-to-Distill

A new arXiv paper proposes using audio large language models to filter noisy training data for end-to-end speech-to-speech translation (S2ST). The authors introduce a two-stage Rank-to-Distill strategy: a lightweight ranker generates pseudo-labels from noisy speech pairs, which then supervise an audio-LLM to make keep/drop decisions directly from raw audio. Experiments on CVSS-C and SpeechMatrix benchmarks show up to +1.4 ASR-BLEU improvement over unfiltered baselines.

Evaluation and Benchmarking Multimodal Progress Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data SpeechMatrix CVSS-C +1 more

4arXiv · cs.CL·12d ago·source ↗

Acoustic cue alignment tokens improve speech emotion recognition in audio language models

Researchers study whether instruction-following audio language models (ALMs) use explicit acoustic cues in a grounded way when raw audio is already available. They derive six interpretable acoustic concept tokens from the eGeMAPS feature set and append them to text prompts, testing on FAU-Aibo and IEMOCAP benchmarks. Aligned tokens improve unweighted average recall while shuffled or corrupted tokens degrade performance, but models don't fully collapse under perturbation, indicating partial anchoring to the audio signal. The work offers a practical probing method for interpretability and robustness in affective computing with ALMs.

Evaluation and Benchmarking Multimodal Progress FAU-Aibo Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition IEMOCAP +1 more

5arXiv · cs.CL·17d ago·source ↗

Synthetic LLM-generated conversations improve ASR training for low-resource languages

Researchers propose a pipeline that uses LLMs to generate scenario-level dialogues and TTS to synthesize multi-speaker audio, creating simulated conversational training data for ASR systems. Evaluated on the Hungarian BEA-Dialogue benchmark, a model trained on 67 hours of real plus 636 hours of synthetic data outperforms a zero-shot model trained on 2,700 hours of real Hungarian speech. The study tests five LLM families under multiple budget and mixing configurations using a FastConformer-Large backbone, finding that generator choice and data composition significantly affect gains.

Evaluation and Benchmarking FastConformer-Large Efficient ASR Training with Conversations that Never Happened BEA-Dialogue

4arXiv · cs.CL·17d ago·source ↗

AlignAtt4LLM adapts simultaneous speech translation policy to decoder-only LLMs for IWSLT 2026

Researchers present AlignAtt4LLM, a simultaneous speech translation system for IWSLT 2026 covering English to German, Italian, and Chinese. The system cascades Qwen3-ASR for incremental transcription with Gemma-4 E4B-it for translation, applying a novel AlignAtt policy adapted for decoder-only LLMs that lack encoder-decoder cross-attention. Key contributions include explicit source span prompting, offline alignment head selection, and query/key capture to recover a usable attention-based read/write policy. The system outperforms IWSLT 2026 baselines for European language pairs in both low- and high-latency regimes.

Evaluation and Benchmarking Multimodal Progress Gemma-4 E4B-it IWSLT 2026 AlignAtt +2 more

6arXiv · cs.CL·5d ago·source ↗

BayLing-Duplex: Native full-duplex speech dialogue using a single autoregressive LLM

Researchers introduce BayLing-Duplex, a speech language model that achieves native full-duplex interaction — simultaneous listening and speaking — using a single autoregressive LLM with no auxiliary VAD or turn-taking module. Built by fine-tuning GLM-4-Voice on 400K samples plus a lightweight DPO stage, it reaches 92% turn-taking success and 100% interruption success on InstructS2S-Eval, and improves speech-response quality substantially over Moshi. The approach adds only special tokens to the standard vocabulary, making it portable across LLM architectures without architectural changes.

Frontier Model Releases Multimodal Progress BayLing-Duplex InstructS2S-Eval Direct Preference Optimization (DPO)+3 more

6arXiv · cs.CL·19d ago·source ↗

UniAudio-Token: Semantic Speech Tokenizer with General Audio Perception for Audio-LLMs

UniAudio-Token is a framework from Tencent that extends semantic speech tokenizers—commonly used as interfaces for Audio-LLMs—to support general audio perception without sacrificing speech quality. It introduces two mechanisms: Semantic-Acoustic Primitives (SAP) for structured supervision decomposing audio into linguistic, vocal, and auditory-scene components, and Semantic-Acoustic Equilibrium (SAE), a content-aware gating mechanism that restores fine-grained acoustic details from shallow layers. Evaluations show it outperforms all single-codebook baseline tokenizers on both understanding and generation tasks when integrated with downstream LLMs. Code, training/inference scripts, and model checkpoints are publicly released.

Agent and Tool Ecosystem Multimodal Progress Audio-LLM UniAudio-Token Tencent +2 more