DDPM - Review | Hun Tae Kim

This time, I’m looking at “Denoising Diffusion Probabilistic Models”, often referred to as DDPM. The paper presented impressive image generation results using diffusion models, a class of models inspired by non-equilibrium thermodynamics.

The core idea is quite interesting: train a model to reverse a gradual noising process. You start with structured data (like an image), progressively add Gaussian noise over many steps until only noise remains, and then train a neural network to reverse this process, starting from noise and gradually denoising it step-by-step to generate a sample.

Key concepts

Forward process (diffusion)

This is a fixed, non-learned process defined by a variance schedule. At each step t, a small amount of Gaussian noise is added to the data from step t-1. This gradually corrupts the input data toward a simple noise distribution, such as a standard Gaussian, after T steps. A neat property is that you can sample the noisy state x_t directly from the original data x_0 using a closed-form equation involving the cumulative product of variances (ᾱ_t), avoiding iteration.

Reverse process (denoising)

This is the learned part. It is also a Markov chain, aiming to reverse the forward process. Starting from pure noise x_T, it iteratively predicts the distribution of the previous, less noisy state x_{t-1} given the current state x_t. Each step p_θ(x_{t-1}|x_t) is parameterized as a Gaussian whose mean and variance are predicted by a neural network, often a U-Net, that takes x_t and the timestep t as input.

Training objective (ELBO and simplified loss)

Training optimizes a variational lower bound (ELBO) on the data likelihood, similar to VAEs. This ELBO can be expressed as a sum of KL divergence terms comparing reverse process steps to tractable posteriors of the forward process. However, the authors found that a simplified objective works very well. Instead of directly predicting the mean of the previous state x_{t-1}, the network predicts the noise (ε) that was added during the corresponding forward step. The simplified objective becomes mean squared error between the true noise and the predicted noise (ε_θ). This connects diffusion models to denoising score matching and Langevin dynamics.

What I learned

Predicting noise, not the image directly

Initially, the idea of training the network to predict noise (ε_θ) was confusing. Why predict random noise? The insight is that it is not predicting any noise, but the specific noise vector ε that was sampled and added to a specific x_0 at a specific step t to produce x_t. Across all timesteps and data points, the network learns the relationship between noise patterns and image structure at different noise levels.

The simplified loss works surprisingly well

The paper proposes a simplified objective (L_simple, Eq. 14) that ignores the complex weighting terms in the full variational bound. It is just mean squared error on predicted noise. It felt like this was offloading a lot of theoretical complexity onto the neural network’s learning capacity, but empirically it led to the best sample quality. This practical simplification was a major takeaway.

Discrete data needs care at the end

The diffusion process operates in continuous space by adding Gaussian noise, but image data is discrete, with pixel values such as 0-255. The final step of the reverse process needs special handling. Equation 13 describes how to get discrete log-likelihoods by integrating the final continuous Gaussian output over bins corresponding to each discrete pixel value. It is a necessary bridge between continuous modeling and discrete data.

Good samples, okay likelihood

The paper notes that while DDPMs achieved excellent sample quality, with SOTA FID scores at the time, their log-likelihoods were not as competitive as some likelihood-based models like flows or autoregressive models. This suggests an interesting trade-off: the diffusion process seems especially good for perceptual quality, even if it does not assign the highest probability density to training data.

Math is dense, intuition is key

The paper is mathematically dense, especially the derivations connecting the ELBO to the noise prediction objective. Following every step was challenging. But the core intuition is clear: reverse a gradual noising process by learning to predict the noise added at each step.

Connection to score matching and Langevin dynamics

The noise prediction parameterization ε_θ explicitly connects DDPM training to denoising score matching across multiple noise scales, and the sampling process resembles annealed Langevin dynamics. This provides a link to other areas of generative modeling and physics-inspired methods.

Summary

DDPMs learn to generate by reversing a fixed diffusion process. The key idea is to predict the noise added at each step, using a simplified objective that works surprisingly well.

The theory is dense, but the core mechanism is intuitive. Diffusion models trade off likelihood quality for perceptual sample quality in a way that turned out to be extremely useful.