4Hugging Face Blog·1mo ago

AudioLDM 2, but faster ⚡️

Hugging Face published a blog post on AudioLDM 2, a latent diffusion model for audio generation, with a focus on inference speed improvements. The post likely covers integration into the Diffusers library and optimization techniques for faster audio synthesis. AudioLDM 2 supports text-to-audio, text-to-music, and text-to-speech generation tasks.

Inference Economics Agent and Tool Ecosystem Multimodal Progress latent diffusion model AudioLDM 2 Hugging Face Diffusers

Related guides (4)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating

Read asIn-depth

Inference EconomicsTopic guide

Inference Economics: The Cost Structure of Running AI Models in Production

Read asIn-depth

Related events (8)

5Hugging Face Blog·28d ago·source ↗

Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models

NVIDIA's Nemotron-Labs introduces diffusion-based language models targeting extremely fast text generation, published as a Hugging Face blog post. The piece covers the approach of using diffusion processes for language modeling as an alternative to autoregressive generation, with a focus on inference speed. This represents a continued push by NVIDIA's research arm into non-autoregressive generation paradigms.

Frontier Model Releases Inference Economics Diffusion Language Models NVIDIA Hugging Face +3 more

4arXiv · cs.CL·12d ago·source ↗

DirectAudioEdit: Training-free, inversion-free text-guided audio editing via diffusion prediction contrast

Researchers introduce DirectAudioEdit, the first training-free and inversion-free method for text-guided audio editing using diffusion denoising dynamics. The approach constructs a source-to-target editing path without requiring DDPM inversion, reducing macro-averaged FAD and KL divergence by ~16% compared to inversion-based baselines while achieving up to 64.5% speedup. Experiments span music and event-level benchmarks across two backbone architectures.

Multimodal Progress DirectAudioEdit DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast

4Hugging Face Blog·1mo ago·source ↗

What's new in Diffusers? — Hugging Face Diffusers Library Second Month Update

Hugging Face published a blog post summarizing new features and updates added to the Diffusers library in its second month of development. The post covers new pipelines, model integrations, and tooling improvements for diffusion-based generative image models. This represents an early-stage ecosystem update for one of the primary open-source libraries supporting text-to-image and related diffusion model workflows.

Agent and Tool Ecosystem Multimodal Progress Hugging Face Diffusers

6Hugging Face Blog·1mo ago·source ↗

Diffusers welcomes Stable Diffusion 3

Hugging Face's Diffusers library adds support for Stable Diffusion 3, enabling users to run Stability AI's latest text-to-image model through the standard Diffusers API. The post covers integration details, usage patterns, and memory optimization techniques for running SD3 locally. This marks the open-weights availability of SD3 through a major ML tooling ecosystem.

Open Weights Progress Agent and Tool Ecosystem Stable Diffusion 3 Hugging Face Stability AI +2 more

7Hugging Face Blog·1mo ago·source ↗

Stable Diffusion with 🧨 Diffusers

Hugging Face published a blog post introducing Stable Diffusion integration with their Diffusers library, covering the model's architecture and how to run it using the open-source tooling. The post appeared at the time of Stable Diffusion's public release in August 2022, marking a significant moment in accessible text-to-image generation. It served as both a technical introduction and a practical guide for the community to adopt the model.

Open Weights Progress Agent and Tool Ecosystem Stable Diffusion 3 Hugging Face Stability AI +2 more

6arXiv · cs.AI·16d ago·source ↗

Audio Interaction Model: Unified Streaming LALM with Always-On Perceive-Decide-Respond Loop

Researchers introduce the Audio Interaction Model framework and a concrete implementation called Audio-Interaction, a unified streaming Large Audio Language Model that handles both offline tasks and real-time audio interaction through a continuous perceive-decide-respond loop. The system is built on SoundFlow, a framework covering data construction, training, and asynchronous low-latency inference. The authors also release StreamAudio-2M, a 2.6M-item streaming corpus spanning 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Evaluated across 8 benchmarks, the model preserves competitive offline performance while enabling real-time ASR, streaming instruction following, and proactive response capabilities not available in prior offline LALMs.

Frontier Model Releases Multimodal Progress Proactive-Sound-Bench Audio Interaction Model StreamAudio-2M +1 more

7Google Deepmind Blog·10d ago·source ↗

DeepMind announces DiffusionGemma with 4x faster text generation

DeepMind published a blog post introducing DiffusionGemma, a diffusion-based variant of the Gemma model family claiming 4x faster text generation. The announcement suggests a departure from standard autoregressive decoding in favor of diffusion-based generation. If the claims hold, this could represent a meaningful inference efficiency advance for the Gemma line.

Frontier Model Releases Inference Economics DiffusionGemma Gemma Google DeepMind

5arXiv · cs.CL·4d ago·source ↗

LESS: Adaptive mutual-stability sampling cuts diffusion LLM decoding steps by 72%

Researchers introduce LESS, a training-free adaptive sampler for diffusion large language models that treats token commitment as an online stopping problem. The method uses a joint stability rule combining confidence, persistence, and distributional stability to decide when to unmask tokens, avoiding wasted computation on already-stable positions. Evaluated on Dream-7B, LLaDA-8B, and LLaDA-1.5-8B across seven benchmarks, LESS reduces reverse denoising steps by 72.1% versus fixed-budget decoding while improving accuracy over prior adaptive samplers. The step reductions translate directly to fewer Transformer forward passes and lower wall-clock latency.

Frontier Model Releases Inference Economics LESS: Mutual-Stability Sampling for Diffusion Language Models Jensen-Shannon divergence LLaDA-1.5-8B +2 more