4arXiv cs.CL (Computation and Language)·12d ago

DirectAudioEdit: Training-free, inversion-free text-guided audio editing via diffusion prediction contrast

Researchers introduce DirectAudioEdit, the first training-free and inversion-free method for text-guided audio editing using diffusion denoising dynamics. The approach constructs a source-to-target editing path without requiring DDPM inversion, reducing macro-averaged FAD and KL divergence by ~16% compared to inversion-based baselines while achieving up to 64.5% speedup. Experiments span music and event-level benchmarks across two backbone architectures.

Multimodal Progress DirectAudioEdit DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast

Related guides (1)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

AudioLDM 2, but faster ⚡️

Hugging Face published a blog post on AudioLDM 2, a latent diffusion model for audio generation, with a focus on inference speed improvements. The post likely covers integration into the Diffusers library and optimization techniques for faster audio synthesis. AudioLDM 2 supports text-to-audio, text-to-music, and text-to-speech generation tasks.

Inference Economics Agent and Tool Ecosystem latent diffusion model AudioLDM 2 Hugging Face +2 more

5arXiv · cs.AI·19d ago·source ↗

TunerDiT: Training-free Progressive Steering of Diffusion Transformers for Multi-Event Video Generation

TunerDiT is a training-free method for steering video diffusion transformers (DiTs) to generate long-horizon videos containing multiple sequential events. The approach identifies intrinsic turning points in the DiT denoising trajectory where text conditioning shifts from global layout to fine-grained detail, then applies two steering mechanisms: Event-Partitioned Masking and Cross-Event Prompt Fusion. The authors also introduce Meve, a benchmark prompt suite for multi-event video generation, and report state-of-the-art results across 8 metrics with improved text alignment scaling with event count.

Evaluation and Benchmarking Inference Economics Meve TunerDiT Event-Partitioned Masking +3 more

5arXiv · cs.CL·9d ago·source ↗

AGDO: Attention-guided denoising and optimization framework improves diffusion language model reasoning

Researchers propose AGDO, a framework that replaces random masking in diffusion large language models (dLLMs) with attention-guided denoising order and token weighting during fine-tuning and reinforcement learning. The work is motivated by an empirical finding that tokens with stronger attention to unmasked context are more stable and critical for reasoning. Experiments on math and coding benchmarks show AGDO outperforms existing post-training methods for dLLMs, advancing the case for attention-aware training in parallel-decoding language models.

Alignment and RLHF AGDO Beyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language Models

5arXiv · cs.CL·11d ago·source ↗

ADAS: Attention-Discounted Adaptive Sampler improves parallel decoding for masked diffusion language models

Researchers propose ADAS, a training-free reranking rule for masked diffusion language model decoding that addresses token interaction failures in parallel token commitment. The method greedily penalizes candidates that attend strongly to already-selected uncertain positions, using attention weights as soft marginal penalties rather than hard constraints. Evaluated on LLaDA-8B-Base and Dream-7B-Base across GSM8K, MATH500, HumanEval, and MBPP, ADAS improves low-NFE performance by 9–10 percentage points on average when plugged into existing samplers with only 3.1% runtime overhead.

Frontier Model Releases Inference Economics LLaDA-8B-Base MATH500 EB-Sampler +6 more

5arXiv · cs.AI·46h ago·source ↗

FlowEdit: Lifelong pronunciation adaptation for flow-matching TTS via associative memory

FlowEdit is a new framework enabling lifelong pronunciation correction in frozen flow-matching text-to-speech systems without retraining model weights. Corrections are stored as token-level perturbations in text embedding space within a Modern Hopfield Network, retrieved at inference via soft attention with fuzzy morphological matching. On a curated benchmark of 312 multilingual proper nouns across 18 language families, the method reduces target-word Phoneme Error Rate by 92.7% relative to the zero-shot baseline, with each correction completing in ~15 seconds on a single GPU.

Inference Economics Enterprise Deployment Patterns Modern Hopfield Network FlowEdit FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS

5arXiv · cs.CL·4d ago·source ↗

ASRD: Training-free anchor-guided revocable decoding for diffusion LLMs improves accuracy and throughput

A new arXiv preprint introduces ASRD (Anchor Supervised Revocable Decoding), a training-free framework for improving decoding quality in diffusion large language models. The method addresses error propagation and local error reinforcement in revocable decoding by separating trusted 'anchor tokens' (identified via temporal consistency) from uncertain candidates, then applying anchor-guided generation and anchor-perturbed verification. Experiments on math and coding benchmarks show up to 6.4% accuracy improvement and 7.2× inference throughput gains over remasking baselines.

Inference Economics ASRD Follow the Latent Roadmap: Navigating Revocable Decoding for Diffusion LLMs with Anchor Tokens

5arXiv · cs.CL·12d ago·source ↗

MMAE: First comprehensive benchmark for instruction-based audio editing across 7 modalities

Researchers introduce MMAE, a 2,000-sample benchmark for evaluating general-purpose instruction-based audio editing systems, covering 7 audio modalities (sound, speech, music, and mixtures) and 6 levels of task complexity. The benchmark uses a rubric-based evaluation framework decomposing tasks into 17,741 verifiable criteria to assess instruction following and context consistency. Evaluation of leading models reveals severe limitations: Exact Match Rate falls below 5% overall and hits 0% on complex mixed-modality tasks, exposing fundamental gaps in current audio editing systems.

Evaluation and Benchmarking Multimodal Progress MMAE Gemini Omni Nano Banana 2

5arXiv · cs.CL·17d ago·source ↗

Knowledge editing via locate-then-edit transferred to masked diffusion language models, revealing multi-token failure mode

A new arXiv paper investigates whether locate-then-edit knowledge editing methods, developed for autoregressive models, transfer to masked diffusion language models (MDMs) such as LLaDA and Dream. The authors find that causal tracing identifies the same early-to-mid-layer MLP location in both paradigms, but MDMs degrade systematically on multi-token edits due to partially unmasked intermediate states that the edit was never optimized for. A correction targeting these intermediate states substantially restores multi-token editing performance. The work is the first systematic comparison of knowledge editing across autoregressive and diffusion-based language model paradigms.

Evaluation and Benchmarking Open Weights Progress Knowledge Editing in Masked Diffusion Language Models Qwen Llama +2 more