paper

Interleaved Speech Language Models Latently Work In Text

paperactiveprovisionalinterleaved-speech-language-models-latently-work-in-text-9c14b352·1 events·first seen 40h ago

Aliases: Interleaved Speech Language Models Latently Work In Text

Co-occurring entities

More like this (12)

Latent Context Language Models Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data CapSpeech-TTS Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models Reinforcement Learning for Language Models Transformer Language Models Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation Speaker Group Encoding in Self-supervised Speech Recognition Models Speech-to-Speech Tapered Language Models multi-turn language models

Recent events (1)

5arXiv · cs.CL·40h ago·source ↗

Interleaved speech-text LMs implicitly transcribe speech in intermediate layers before predicting in text space

A new arXiv paper analyzes the internal mechanisms of interleaved speech-text language models using the logit lens, revealing that these models undergo an implicit transcription phase in intermediate layers where the text token of a spoken word becomes decodable despite no explicit speech recognition training. This transcription appears as a top candidate word for up to 77% of the data, after which the model predicts the next word in text space before converting back to speech. The findings illuminate how speech and text modalities interact in the latent space of SLMs and have implications for optimizing speech language model training.

Evaluation and Benchmarking Multimodal Progress Interleaved Speech Language Models Latently Work In Text logit lens