paper
Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation
paperactiveprovisional
which-speech-representation-better-matches-text-native-reasoning-a-study-of-speech-text-alignment-on-frame-rate-and-representation-ddf19a40·1 events·first seen 6d agoAliases: Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation
Co-occurring entities
More like this (12)
Acoustic Cue Alignment in Audio Language Models for Speech Emotion RecognitionReasoning Language ModelsMulti-Faceted Interactivity Alignment in Full-Duplex Speech ModelsDoes Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning ModelsSpeaker Group Encoding in Self-supervised Speech Recognition ModelsReasoning over Grammar: Can Synthetic Linguistic Reasoning Traces Enhance Low-Resource Machine Translation?Contrastive-Difference CKA Reveals Concept-Specific Structural Alignment Across Language Model ArchitecturesThe Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language ModelLeveraging Audio-LLMs to Filter Speech-to-Speech Training DataRepresentational Similarity AnalysisExploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language ModelsSpeech-to-Speech
Recent events (1)
Study finds optimal speech token frame rate for aligning speech with text-native LLM reasoning
Researchers identify a temporal-granularity mismatch as a key cause of reasoning degradation in spoken dialogue models: speech tokens are far longer than text under matched semantics, diluting per-token semantic density. The paper introduces factorized FSQ and a non-autoregressive audio LM head to enable low frame rates, then sweeps frame rates from 50Hz down to 2.08Hz under a frozen LLM backbone. Results show a consistent optimal regime at 4.17Hz with intermediate-layer representation alignment for speech QA tasks.