paper
Interleaved Speech Language Models Latently Work In Text
paperactiveprovisional
interleaved-speech-language-models-latently-work-in-text-9c14b352·1 events·first seen 40h agoAliases: Interleaved Speech Language Models Latently Work In Text
Co-occurring entities
More like this (12)
Latent Context Language ModelsLeveraging Audio-LLMs to Filter Speech-to-Speech Training DataCapSpeech-TTSCross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and LipreadingMulti-Faceted Interactivity Alignment in Full-Duplex Speech ModelsReinforcement Learning for Language ModelsTransformer Language ModelsWhich Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and RepresentationSpeaker Group Encoding in Self-supervised Speech Recognition ModelsSpeech-to-SpeechTapered Language Modelsmulti-turn language models
Recent events (1)
Interleaved speech-text LMs implicitly transcribe speech in intermediate layers before predicting in text space
A new arXiv paper analyzes the internal mechanisms of interleaved speech-text language models using the logit lens, revealing that these models undergo an implicit transcription phase in intermediate layers where the text token of a spoken word becomes decodable despite no explicit speech recognition training. This transcription appears as a top candidate word for up to 77% of the data, after which the model predicts the next word in text space before converting back to speech. The findings illuminate how speech and text modalities interact in the latent space of SLMs and have implications for optimizing speech language model training.