Cross-attention attribution reveals how natural language instructions shape speech diffusion model outputs
Researchers adapt the DAAM cross-attention attribution framework to speech diffusion models for the first time, applying it to CapSpeech-TTS to analyze how individual caption tokens influence acoustic output. The study analyzes 3,600 style-caption/transcript combinations across 25 layers and 24 ODE steps, producing per-token heatmaps. Key findings include that style tokens exhibit lower temporal variance than content tokens, style attention correlates with F0 and energy, and style conditioning peaks in early diffusion steps and deep layers. This is the first interpretability study of natural language conditioning in speech diffusion models.
Related guides (2)
Related events (8)
AGDO: Attention-guided denoising and optimization framework improves diffusion language model reasoning
Researchers propose AGDO, a framework that replaces random masking in diffusion large language models (dLLMs) with attention-guided denoising order and token weighting during fine-tuning and reinforcement learning. The work is motivated by an empirical finding that tokens with stronger attention to unmasked context are more stable and critical for reasoning. Experiments on math and coding benchmarks show AGDO outperforms existing post-training methods for dLLMs, advancing the case for attention-aware training in parallel-decoding language models.
ADAS: Attention-Discounted Adaptive Sampler improves parallel decoding for masked diffusion language models
Researchers propose ADAS, a training-free reranking rule for masked diffusion language model decoding that addresses token interaction failures in parallel token commitment. The method greedily penalizes candidates that attend strongly to already-selected uncertain positions, using attention weights as soft marginal penalties rather than hard constraints. Evaluated on LLaDA-8B-Base and Dream-7B-Base across GSM8K, MATH500, HumanEval, and MBPP, ADAS improves low-NFE performance by 9–10 percentage points on average when plugged into existing samplers with only 3.1% runtime overhead.
Trajectory Analysis of Masked Diffusion LMs for Graph-to-Text Generation with Lambda-Scaled Structural Decoding
This paper presents the first systematic study of masked diffusion language models (MDLMs) for graph-to-text generation, analyzing the order in which tokens are unmasked during iterative decoding. The authors find MDLMs naturally unmask entities first, then relational/function words, then structural tokens—a pattern disrupted by supervised fine-tuning, which prematurely anchors structural tokens and causes hallucination or omission. They propose lambda-scaled structural decoding, a training-free inference-time fix that recovers +9.4 BLEU-4, and introduce Graph-LLaDA, which integrates a Graph Transformer encoder into LLaDA's decoding process. Cross-dataset evaluation on the LAGRANGE benchmark shows prior baselines overfit to dataset-specific patterns while MDLM-based approaches generalize better.
SimSD: Speculative Decoding Adapted for Diffusion Language Models
SimSD introduces a training-free speculative decoding algorithm for diffusion large language models (dLLMs), which previously could not use standard token-level speculative decoding due to their bidirectional attention and masked language modeling formulation. The method uses a plug-and-play masking strategy that introduces reference tokens from a draft model and a custom attention mask, enabling valid logit computation for drafted tokens in a single forward pass. Evaluated on SDAR-family dLLMs across four benchmarks, SimSD achieves up to 7.46x decoding throughput improvement while maintaining or improving generation quality. The approach is compatible with other acceleration techniques such as KV cache and blockwise decoding.
Representation-Conditioned Diffusion Models for Controllable Image Generation
This paper explores conditioning diffusion models on representations from pre-trained self-supervised models as an alternative to text prompts or semantic maps, which require large annotated datasets. The self-conditioning mechanism improves unconditional image generation quality and provides a controllable representation space. The authors identify directions of variation in this space and demonstrate smoothness and disentanglement properties, suggesting potential for fine-grained generative control without heavy annotation overhead.
DirectAudioEdit: Training-free, inversion-free text-guided audio editing via diffusion prediction contrast
Researchers introduce DirectAudioEdit, the first training-free and inversion-free method for text-guided audio editing using diffusion denoising dynamics. The approach constructs a source-to-target editing path without requiring DDPM inversion, reducing macro-averaged FAD and KL divergence by ~16% compared to inversion-based baselines while achieving up to 64.5% speedup. Experiments span music and event-level benchmarks across two backbone architectures.
Knowledge editing via locate-then-edit transferred to masked diffusion language models, revealing multi-token failure mode
A new arXiv paper investigates whether locate-then-edit knowledge editing methods, developed for autoregressive models, transfer to masked diffusion language models (MDMs) such as LLaDA and Dream. The authors find that causal tracing identifies the same early-to-mid-layer MLP location in both paradigms, but MDMs degrade systematically on multi-token edits due to partially unmasked intermediate states that the edit was never optimized for. A correction targeting these intermediate states substantially restores multi-token editing performance. The work is the first systematic comparison of knowledge editing across autoregressive and diffusion-based language model paradigms.
Acoustic cue alignment tokens improve speech emotion recognition in audio language models
Researchers study whether instruction-following audio language models (ALMs) use explicit acoustic cues in a grounded way when raw audio is already available. They derive six interpretable acoustic concept tokens from the eGeMAPS feature set and append them to text prompts, testing on FAU-Aibo and IEMOCAP benchmarks. Aligned tokens improve unweighted average recall while shuffled or corrupted tokens degrade performance, but models don't fully collapse under perturbation, indicating partial anchoring to the audio signal. The work offers a practical probing method for interpretability and robustness in affective computing with ALMs.

