AudioLDM 2, but faster ⚡️
Hugging Face published a blog post on AudioLDM 2, a latent diffusion model for audio generation, with a focus on inference speed improvements. The post likely covers integration into the Diffusers library and optimization techniques for faster audio synthesis. AudioLDM 2 supports text-to-audio, text-to-music, and text-to-speech generation tasks.
Related guides (4)
Related events (8)
Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models
NVIDIA's Nemotron-Labs introduces diffusion-based language models targeting extremely fast text generation, published as a Hugging Face blog post. The piece covers the approach of using diffusion processes for language modeling as an alternative to autoregressive generation, with a focus on inference speed. This represents a continued push by NVIDIA's research arm into non-autoregressive generation paradigms.
DirectAudioEdit: Training-free, inversion-free text-guided audio editing via diffusion prediction contrast
Researchers introduce DirectAudioEdit, the first training-free and inversion-free method for text-guided audio editing using diffusion denoising dynamics. The approach constructs a source-to-target editing path without requiring DDPM inversion, reducing macro-averaged FAD and KL divergence by ~16% compared to inversion-based baselines while achieving up to 64.5% speedup. Experiments span music and event-level benchmarks across two backbone architectures.
What's new in Diffusers? — Hugging Face Diffusers Library Second Month Update
Hugging Face published a blog post summarizing new features and updates added to the Diffusers library in its second month of development. The post covers new pipelines, model integrations, and tooling improvements for diffusion-based generative image models. This represents an early-stage ecosystem update for one of the primary open-source libraries supporting text-to-image and related diffusion model workflows.
Diffusers welcomes Stable Diffusion 3
Hugging Face's Diffusers library adds support for Stable Diffusion 3, enabling users to run Stability AI's latest text-to-image model through the standard Diffusers API. The post covers integration details, usage patterns, and memory optimization techniques for running SD3 locally. This marks the open-weights availability of SD3 through a major ML tooling ecosystem.
Stable Diffusion with 🧨 Diffusers
Hugging Face published a blog post introducing Stable Diffusion integration with their Diffusers library, covering the model's architecture and how to run it using the open-source tooling. The post appeared at the time of Stable Diffusion's public release in August 2022, marking a significant moment in accessible text-to-image generation. It served as both a technical introduction and a practical guide for the community to adopt the model.
Audio Interaction Model: Unified Streaming LALM with Always-On Perceive-Decide-Respond Loop
Researchers introduce the Audio Interaction Model framework and a concrete implementation called Audio-Interaction, a unified streaming Large Audio Language Model that handles both offline tasks and real-time audio interaction through a continuous perceive-decide-respond loop. The system is built on SoundFlow, a framework covering data construction, training, and asynchronous low-latency inference. The authors also release StreamAudio-2M, a 2.6M-item streaming corpus spanning 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Evaluated across 8 benchmarks, the model preserves competitive offline performance while enabling real-time ASR, streaming instruction following, and proactive response capabilities not available in prior offline LALMs.
DeepMind announces DiffusionGemma with 4x faster text generation
DeepMind published a blog post introducing DiffusionGemma, a diffusion-based variant of the Gemma model family claiming 4x faster text generation. The announcement suggests a departure from standard autoregressive decoding in favor of diffusion-based generation. If the claims hold, this could represent a meaningful inference efficiency advance for the Gemma line.
LESS: Adaptive mutual-stability sampling cuts diffusion LLM decoding steps by 72%
Researchers introduce LESS, a training-free adaptive sampler for diffusion large language models that treats token commitment as an online stopping problem. The method uses a joint stability rule combining confidence, persistence, and distributional stability to decide when to unmask tokens, avoiding wasted computation on already-stable positions. Evaluated on Dream-7B, LLaDA-8B, and LLaDA-1.5-8B across seven benchmarks, LESS reduces reverse denoising steps by 72.1% versus fixed-budget decoding while improving accuracy over prior adaptive samplers. The step reductions translate directly to fewer Transformer forward passes and lower wall-clock latency.



