paper

Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation

paperactiveprovisional

which-speech-representation-better-matches-text-native-reasoning-a-study-of-speech-text-alignment-on-frame-rate-and-representation-ddf19a40

·1 events·first seen 6d ago

Aliases: Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation

Co-occurring entities

factorized FSQ

More like this (12)

Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition Reasoning Language Models Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models Speaker Group Encoding in Self-supervised Speech Recognition Models Reasoning over Grammar: Can Synthetic Linguistic Reasoning Traces Enhance Low-Resource Machine Translation?Contrastive-Difference CKA Reveals Concept-Specific Structural Alignment Across Language Model Architectures The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data Representational Similarity Analysis Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models Speech-to-Speech

Recent events (1)

5arXiv · cs.CL·6d ago·source ↗

Study finds optimal speech token frame rate for aligning speech with text-native LLM reasoning

Researchers identify a temporal-granularity mismatch as a key cause of reasoning degradation in spoken dialogue models: speech tokens are far longer than text under matched semantics, diluting per-token semantic density. The paper introduces factorized FSQ and a non-autoregressive audio LM head to enable low frame rates, then sweeps frame rates from 50Hz down to 2.08Hz under a frozen LLM backbone. Results show a consistent optimal regime at 4.17Hz with intermediate-layer representation alignment for speech QA tasks.

Evaluation and Benchmarking Multimodal Progress Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation factorized FSQ