Almanac
paper

How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech

paperactiveprovisionalhow-do-instructions-shape-speech-cross-attention-attribution-for-style-captioned-text-to-speech-ed504683·1 events·first seen 2d ago

Aliases: How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech

Co-occurring entities

More like this (12)

Recent events (1)

4arXiv · cs.AI·2d ago·source ↗

Cross-attention attribution reveals how natural language instructions shape speech diffusion model outputs

Researchers adapt the DAAM cross-attention attribution framework to speech diffusion models for the first time, applying it to CapSpeech-TTS to analyze how individual caption tokens influence acoustic output. The study analyzes 3,600 style-caption/transcript combinations across 25 layers and 24 ODE steps, producing per-token heatmaps. Key findings include that style tokens exhibit lower temporal variance than content tokens, style attention correlates with F0 and energy, and style conditioning peaks in early diffusion steps and deep layers. This is the first interpretability study of natural language conditioning in speech diffusion models.