paper

How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech

paperactiveprovisionalhow-do-instructions-shape-speech-cross-attention-attribution-for-style-captioned-text-to-speech-ed504683·1 events·first seen 2d ago

Aliases: How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech

Co-occurring entities

DAAM CapSpeech-TTS

More like this (12)

Connecting Speech to Words through Images CapSpeech-TTS instructable text-to-speech Block-Compositional Caption Supervision Speech-to-Speech FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation Text Generation Inference Explaining Attention with Program Synthesis Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models Decomposing Factual Sycophancy in Language Models: How Size and Instruction Tuning Shape Robustness Visual Instruction Tuning Aligns Modalities through Abstraction

Recent events (1)

4arXiv · cs.AI·2d ago·source ↗

Cross-attention attribution reveals how natural language instructions shape speech diffusion model outputs

Researchers adapt the DAAM cross-attention attribution framework to speech diffusion models for the first time, applying it to CapSpeech-TTS to analyze how individual caption tokens influence acoustic output. The study analyzes 3,600 style-caption/transcript combinations across 25 layers and 24 ODE steps, producing per-token heatmaps. Key findings include that style tokens exhibit lower temporal variance than content tokens, style attention correlates with F0 and energy, and style conditioning peaks in early diffusion steps and deep layers. This is the first interpretability study of natural language conditioning in speech diffusion models.

Evaluation and Benchmarking Multimodal Progress DAAM How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech CapSpeech-TTS