Concept guide · Beginner

Diffusion Models: How AI Learns to Paint by Unpainting

Diffusion ModelsBeginneractive·v1 · live·generated 6d ago

TL;DRDiffusion models are the engine behind today's AI image generators — they work by learning to reverse a process of adding noise, gradually sculpting a crisp picture from static. The technique has spread far beyond images into audio, video, and even scientific fields like materials discovery, and researchers are now racing to make it faster without sacrificing quality.

Key takeaways

DALL-E 2 (2022) was an early landmark: it used a diffusion decoder paired with CLIP text-image embeddings to generate images from text prompts.
The main practical drawback is speed — standard diffusion requires many sampling steps; Consistency Models (OpenAI, 2024) cut that to one or two steps.
Continuous-time Consistency Models (sCMs, Oct 2024) matched leading diffusion quality in just two steps, addressing earlier instability in the approach.
Diffusion models are now used in scientific domains: a 2026 review covers their application to discovering new crystalline materials.
Active research is exposing failure modes — a 2026 paper shows popular posterior samplers can systematically hallucinate or mis-weight outputs, a useful diagnostic for practitioners.

What a diffusion model is

A diffusion model is a type of AI that learns to generate new things — images, audio, video, even molecular structures — by mastering the art of undoing destruction. Here's the core idea: take a real image and gradually bury it in random noise until it looks like TV static. Do this thousands of times with thousands of images. The AI watches all of it, and learns the reverse trick: given a noisy mess, what's the most likely cleaner version one step back? Chain enough of those steps together, and you can start from pure static and arrive at a sharp, coherent image.

This "learn to denoise" approach turns out to be surprisingly powerful and flexible — which is why it became the engine inside most of today's AI image generators.

Why you should care

If you've used an AI image tool in the last few years, you've almost certainly used a diffusion model under the hood. They're the reason you can type "a fox in a library, oil painting style" and get a convincing result in seconds. Beyond images, the same technique is being applied to audio, video, and — perhaps most surprisingly — science. A 2026 survey found diffusion models being used to help discover new crystalline materials, generating candidate structures for researchers to test in the lab.

How it works (the simple version)

Imagine a two-phase training process:

1. Forward pass (adding noise): Take a real image. Add a tiny bit of random noise. Add more. Keep going until the image is unrecognizable static. Record every step. 2. Reverse pass (learning to denoise): Train the AI to predict, at each step, what the slightly-less-noisy version looked like. After seeing millions of examples, it gets very good at this.

At generation time, you skip the forward pass entirely. You hand the AI pure noise and let it run the reverse process — step by step, it sculpts something coherent out of nothing.

For text-to-image systems like DALL-E 2, there's an extra ingredient: a model called CLIP that translates your text prompt into a numerical "meaning vector." The diffusion process is then guided by that vector, steering the image toward what you described.

The speed problem — and the fix

The main catch is that "step by step" can mean a lot of steps — sometimes dozens or hundreds — which makes generation slow. OpenAI introduced Consistency Models in 2024 to tackle this: instead of learning one denoising step at a time, the model learns to jump directly from any noisy version to the final clean image. That means you can generate in a single step or two, rather than a hundred.

A follow-up called continuous-time Consistency Models (sCMs), also from OpenAI, refined the approach further — achieving image quality comparable to the best multi-step diffusion models while needing only two sampling steps. Earlier versions of consistency models had training instability issues; sCMs addressed those directly.

Controlling what gets generated

Researchers are also working on finer-grained control over outputs. One line of work conditions the diffusion process on representations from self-supervised models (rather than text prompts), allowing smooth, disentangled control over image properties without needing large annotated datasets. Another approach uses Multimodal Large Language Models to jointly encode both a text description and a reference image, helping the model preserve a specific subject's identity across generated images.

Where things can go wrong

Diffusion models aren't infallible. A 2026 paper introduced a theoretical framework for understanding when diffusion-based "posterior samplers" — used in tasks like reconstructing medical images from incomplete data — fail. The finding: popular approximations can systematically over- or under-estimate uncertainty, leading to hallucinated features. This matters most in high-stakes domains like medical imaging, and the framework serves as a diagnostic tool for catching these issues.

A related technique, KLIP, uses diffusion model priors to flag when an input image is out-of-distribution — for example, detecting a liver tumor in a CT scan that looks unlike anything in the training data — without needing calibration data.

The bigger picture

Diffusion models have moved from a research curiosity to production infrastructure in just a few years. The current frontier is threefold: making them faster (consistency models), making them more controllable (representation conditioning, multimodal guidance), and making them trustworthy in scientific and medical settings (failure-mode analysis, out-of-distribution detection). Each of these threads is active and moving quickly.

How a diffusion model generates an image

Timeline

FAQ

What is a diffusion model, in plain terms?

Think of it like teaching an AI to restore a shredded photo. You show it thousands of examples of photos being progressively buried in static, then train it to run that process in reverse — starting from pure noise and gradually revealing a clean image.

Why does it take so many steps, and is that changing?

Each step removes a little noise, so quality traditionally required dozens or hundreds of passes. Consistency Models (2024) and their continuous-time successors now match that quality in just one or two steps, making generation much faster.

Is this only useful for making pictures?

No — the same core idea applies to audio, video, and scientific problems. Researchers are using diffusion models to help discover new materials by generating candidate crystal structures.

Can diffusion models make mistakes or hallucinate?

Yes. A 2026 study showed that common methods for solving 'inverse problems' (like reconstructing a medical image) can systematically over- or under-estimate uncertainty, sometimes hallucinating features that aren't really there.

How does text-to-image work with diffusion?

A system like DALL-E 2 first converts your text prompt into a numerical 'meaning vector' using a model called CLIP, then uses a diffusion process to paint an image that matches that meaning — bridging language and pixels.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

v1live6d ago

Related guides (4)

Diffusion ModelsConcept

Diffusion Models: Mechanism, Variants, and the Push Toward Efficient Sampling

Read asIn-depth

Vision-Language ModelsConcept

Vision-Language Models: Teaching AI to See and Read at Once

Read asBeginner In-depth

Mixture of ExpertsConcept

Mixture of Experts: How AI Models Do More by Using Less

Read asBeginner In-depth

knowledge distillationConcept

Knowledge Distillation: Compressing Model Intelligence into Smaller, Faster Successors

Read asIn-depth

More on Diffusion Models (6)

5arXiv · cs.LG·24d ago·source ↗

Representation-Conditioned Diffusion Models for Controllable Image Generation

This paper explores conditioning diffusion models on representations from pre-trained self-supervised models as an alternative to text prompts or semantic maps, which require large annotated datasets. The self-conditioning mechanism improves unconditional image generation quality and provides a controllable representation space. The authors identify directions of variation in this space and demonstrate smoothness and disentanglement properties, suggesting potential for fine-grained generative control without heavy annotation overhead.

Frontier Model Releases Multimodal Progress Representation-Conditioned Diffusion Models Self-Supervised Learning Disentangled Representation Learning +1 more

6arXiv · cs.LG·22d ago·source ↗

Finite-Sample Lens for Understanding Diffusion Posterior Sampler Failures

This paper introduces a finite-sample theoretical framework for analyzing diffusion model posterior samplers used in imaging inverse problems. The authors show that popular likelihood approximations at intermediate timesteps systematically under- or over-estimate posterior spread, leading to failure modes including sensitivity to early stopping, incorrect weighting of posterior modes, and hallucination of prior or likelihood modes. Crucially, they demonstrate these failures can arise from a multimodal prior alone, without requiring nonlinear measurement models or multimodal posteriors. The framework is model-agnostic and can serve as a diagnostic tool for evaluating existing and future posterior samplers.

Evaluation and Benchmarking AI Safety Research finite-sample posterior sampling framework likelihood approximation imaging inverse problems +3 more

5arXiv · cs.LG·19d ago·source ↗

KLIP: Localized OOD Detection in Inverse Problems via KL-Divergence with Diffusion Priors

KLIP proposes an out-of-distribution detection metric for computational imaging that computes KL-divergence between a diffusion model prior and the posterior distribution. Unlike prior approaches, it requires no calibration data or knowledge of the shifted distribution, and can both flag whole images and localize OOD patches within images. The method is validated on medical imaging tasks such as detecting liver tumors in CT scans and generalizes across diffusion model architectures, datasets, and inverse problem types.

Evaluation and Benchmarking AI Safety Research KLIP out-of-distribution detection computational imaging +3 more

7Openai Blog·1mo ago·source ↗

Simplifying, Stabilizing, and Scaling Continuous-Time Consistency Models

OpenAI has published research advancing continuous-time consistency models (sCMs), achieving sample quality comparable to leading diffusion models while requiring only two sampling steps. The work addresses prior instability and complexity issues in consistency model training. This represents a significant efficiency improvement for generative image synthesis, potentially enabling faster inference pipelines.

Inference Economics Multimodal Progress OpenAI Continuous-Time Consistency Models Diffusion Models

6Openai Blog·1mo ago·source ↗

Consistency Models

OpenAI introduces Consistency Models, a new generative modeling framework designed to address the slow iterative sampling process inherent in diffusion models. The approach aims to enable faster single-step or few-step generation for image, audio, and video synthesis. The post appears to be a research announcement or blog summary of the underlying technique.

Inference Economics Multimodal Progress Latent Consistency Models OpenAI Diffusion Models

5arXiv · cs.LG·18d ago·source ↗

Review: Generative Models, Multimodal Learning, and Closed-Loop Workflows in Inverse Materials Design

This arxiv review surveys recent advances in generative modeling for inverse materials design, covering variational autoencoders, normalizing flows, autoregressive models, and diffusion models applied to crystalline solid discovery. It examines how multimodal learning fuses crystal structures, thermodynamic data, spectroscopy, microscopy, and scientific text into transferable chemical-space representations. The paper also reviews closed-loop design pipelines integrating conditional generation with Bayesian optimization, reinforcement learning, and active learning, and identifies recurring failure modes including surrogate exploitation, diversity collapse, and the stability-synthesizability gap.

Evaluation and Benchmarking Agent and Tool Ecosystem Bayesian Optimization Multimodal Learning active learning +6 more