From DDPM to DDIM — Accelerate Sampling without Extra Training

1. Introduction

Although Denoising Diffusion Probabilistic Models (DDPM) have shown great success in generative tasks, their sampling process is often slow due to the need for many iterative steps (e.g., 1000 steps), for each requiring a forward pass through the neural network.

To address this issue, researchers from Stanford University proposed Denoising Diffusion Implicit Models (DDIM), which allow for accelerated sampling without the need for additional training from a pre-trained DDPM model. DDIM achieve this by introducing a non-Markovian diffusion process that enables fewer sampling steps while maintaining high-quality outputs.

2. Key Concepts

DDIM follow a similar notation to DDPM, where $x_t$ represents the noisy data at time step $t$ , and $\epsilon_\theta(x_t, t)$ is the neural network predicting the noise component. The key difference lies in the sampling process.

p_\theta(\mathrm x_0) = \int p_\theta(\mathrm x_{0:T}) \mathrm d\mathrm x_{1:T}

where

p_\theta(\mathrm x_{0:T}) = p(\mathrm x_T) \prod_{t=1}^{T} p_\theta(\mathrm x_{t-1}|\mathrm x_t)

DDPM use a special property of the forward process:

q(\mathrm x_{t}|\mathrm x_0) = \mathcal N(\mathrm x_t; \sqrt{\bar \alpha_t} \mathrm x_0, (1-\bar \alpha_t) \mathbf I)

so that $\mathrm x_t$ can be reparameterized as:

\mathrm x_t = \sqrt{\bar \alpha_t} \mathrm x_0 + \sqrt{1-\bar \alpha_t} \epsilon, \quad \epsilon \sim \mathcal N(0, \mathbf I)

Therefore, the posterior distribution (which the model learns to predict) can be derived as:

\begin{aligned} q(\mathrm x_{t-1}|\mathrm x_t,\mathrm x_0) &= \frac{q(\mathrm x_t|\mathrm x_{t-1}, \mathrm x_0) q(\mathrm x_{t-1}|\mathrm x_0)}{q(\mathrm x_t|\mathrm x_0)} \\ &= \mathcal N\left(\mathrm x_{t-1}; \tilde \mu_t(\mathrm x_t, \mathrm x_0), \tilde \beta_t \mathbf I\right) \end{aligned}

the derivation of $\tilde \mu_t(\mathrm x_t, \mathrm x_0)$ and $\tilde \beta_t$ relies on the Markovian assumption and the reparameterization of $\mathrm x_t$ .

The key idea of DDIM is to generalize this posterior distribution to a non-Markovian form with arbitrary transition steps.

3. Forward Process

The forward process of DDPM is defined as:

q(\mathrm x_{1:T}|\mathrm x_0) = \prod_{t=1}^{T} q(\mathrm x_t|\mathrm x_{t-1})

where

q(\mathrm x_t|\mathrm x_{t-1}) = \mathcal N(\mathrm x_t; \sqrt{1-\beta_t} \mathrm x_{t-1}, \ \beta_t \mathbf I)

What we are going to use is the property that we can directly sample $\mathrm x_t$ from $\mathrm x_0$ :

q(\mathrm x_t|\mathrm x_0) = \mathcal N(\mathrm x_t; \sqrt{\bar \alpha_t} \mathrm x_0, (1-\bar \alpha_t) \mathbf I)

Consider an arbitrary set of time steps $k,s \in \{0, 1, \ldots, T\}$ with $s \leq k - 1$ . We want to find the distribution $q(\mathrm x_s|\mathrm x_k, \mathrm x_0)$ , so that we can skip some steps during sampling.

However, unlike the Markovian case, we cannot derive this distribution directly. Instead, we make an assumption that it follows a Gaussian distribution:

q(\mathrm x_s |\mathrm x_k, \mathrm x_0) = \mathcal N(k\mathrm x_0 + m\mathrm x_k, \sigma^2 \mathbf I)

where $k, m, \sigma$ are coefficients to be determined. We need to solve these three coefficients.

\begin{aligned} x_s &= k \mathrm x_0 + m \mathrm x_k + \sigma \epsilon, \quad \epsilon \sim \mathcal N(0, \mathbf I) \\ &= k \mathrm x_0 + m(\sqrt{\bar \alpha_k} \mathrm x_0 + \sqrt{1-\bar \alpha_k} \epsilon') + \sigma \epsilon, \quad \epsilon' \sim \mathcal N(0, \mathbf I) \\ &= (k + m \sqrt{\bar \alpha_k}) \mathrm x_0 + m \sqrt{1-\bar \alpha_k} \epsilon' + \sigma \epsilon\\ &\sim \mathcal N\left((k + m \sqrt{\bar \alpha_k}) \mathrm x_0, m^2 (1-\bar \alpha_k) + \sigma^2\right) \end{aligned}

To ensure consistency with the forward process, we need to match the mean and variance with those of $q(\mathrm x_s|\mathrm x_0)$ :

q(\mathrm x_s|\mathrm x_0) = \mathcal N(\sqrt{\bar \alpha_s} \mathrm x_0, (1-\bar \alpha_s) \mathbf I)

This gives us two equations:

\begin{aligned} k + m \sqrt{\bar \alpha_k} &= \sqrt{\bar \alpha_s} \\ m^2 (1-\bar \alpha_k) + \sigma^2 &= 1-\bar \alpha_s \end{aligned}

We treat $\sigma$ as a hyperparameter to control the stochasticity of the sampling process. By solving the above equations, we can obtain:

\begin{aligned} m &= \sqrt{\frac{(1-\bar \alpha_s) - \sigma^2}{1-\bar \alpha_k}} \\ k &= \sqrt{\bar \alpha_s} - m \sqrt{\bar \alpha_k} \end{aligned}

With these coefficients, we can define the DDIM sampling step from $\mathrm x_k$ to $\mathrm x_s$ as:

\begin{aligned} q(\mathrm x_s |\mathrm x_k, \mathrm x_0) &= \mathcal N\left(\mathrm x_s; k \mathrm x_0 + m \mathrm x_k, \sigma^2 \mathbf I\right)\\ &= \mathcal N\left(\mathrm x_s; \sqrt{\bar \alpha_s} \mathrm x_0 + \sqrt{(1-\bar \alpha_s) - \sigma^2} \cdot \frac{\mathrm x_k - \sqrt{\bar \alpha_k} \mathrm x_0}{\sqrt{1-\bar \alpha_k}}, \sigma^2 \mathbf I\right) \end{aligned}

The magnitude of $\sigma$ controls how stochastic the forward process is; when $\sigma \to 0$ , we reach an extreme case where as long as we observe $x_0$ and $x_t$ for some $t$ , then $x_{t-1}$ becomes known and fixed.

The forward process of DDIM indeed changes the forward process from a Markovian one to a non-Markovian one, and makes an assumption on the form of the transition distribution, this causing a doubt on whether the learned model can still work well under this new forward process.

However, what DDPM truly use in the forward process is the reparameterization trick $\mathrm x_t = \sqrt{\bar \alpha_t} \mathrm x_0 + \sqrt{1-\bar \alpha_t} \epsilon$ , which is still valid in DDIM since the forward process of DDIM doesn’t change this property, and thus the learned model can still be used in DDIM sampling, which we will explain in the next section.

4. Sampling Process

As we have trained a DDPM model to predict $\epsilon_\theta(\mathrm x_t, t)$ , we can use it to estimate $\mathrm x_{t-1}$ from $\mathrm x_t$ :

\begin{aligned} \mathrm x_{t-1} &= \frac{1}{\sqrt{\alpha_t}}\left(\mathrm x_t - \frac{1-\alpha_t}{\sqrt{1-\bar \alpha_t}} \epsilon_\theta(\mathrm x_t, t)\right) + \sigma_t \epsilon, \quad \epsilon \sim \mathcal N(0, \mathbf I)\\ \end{aligned}

Essentially, what we have learned in DDPM is to predict $\mathrm x_{t-1}$ from $\mathrm x_t$ :

p_\theta(\mathrm x_{t-1}|\mathrm x_t) = \mathcal N\left(\mathrm x_{t-1}; \tilde \mu_\theta(\mathrm x_t, t), \sigma_t^2 \mathbf I\right)

The goal of DDPM is to minimize the difference between $p_\theta(\mathrm x_{t-1}|\mathrm x_t)$ and $q(\mathrm x_{t-1}|\mathrm x_t,\mathrm x_0)$ . However, DDPM perform the reparameterization of $\mathrm x_t$ in terms of $\mathrm x_0$ and $\epsilon$ :

\mathrm x_t = \sqrt{\bar \alpha_t} \mathrm x_0 + \sqrt{1-\bar \alpha_t} \epsilon, \quad \epsilon \sim \mathcal N(0, \mathbf I)

Thus, in essence, we learn the noise $\epsilon$ to predict $\mathrm x_0$ from $\mathrm x_t$ in DDPM:

\mathrm x_0 = \frac{\mathrm x_t - \sqrt{1-\bar \alpha_t} \epsilon}{\sqrt{\bar \alpha_t}}

With the learned $\epsilon$ , during the sampling process of DDIM, we can substitute the predicted noise $\epsilon_\theta(\mathrm x_t, t)$ for $\epsilon$ to give an estimate of $\mathrm x_0$ :

\mathrm{\hat{x}}_0(\mathrm{x}_k, k) = \frac{\mathrm{x}_k - \sqrt{1-\alpha_k} \mathrm{\epsilon}_\theta(\mathrm{x}_k, k)}{\sqrt{\alpha_k}}

Then we can generalize this to predict $\mathrm x_s$ from $\mathrm x_k$ :

\begin{aligned} \mathrm x_s &= \sqrt{\bar \alpha_s} \mathrm {\hat{x}}_0 + \sqrt{(1-\bar \alpha_s) - \sigma^2} \cdot \frac{\mathrm x_k - \sqrt{\bar \alpha_k} \mathrm {\hat{x}}_0}{\sqrt{1-\bar \alpha_k}} + \sigma \epsilon, \quad \epsilon \sim \mathcal N(0, \mathbf I)\\ &= \sqrt{\bar \alpha_s} \left(\frac{\mathrm x_k - \sqrt{1-\bar \alpha_k} \epsilon_\theta(\mathrm x_k, k)}{\sqrt{\bar \alpha_k}}\right) + \sqrt{(1-\bar \alpha_s) - \sigma^2} \cdot \frac{\mathrm x_k - \sqrt{\bar \alpha_k} \left(\frac{\mathrm x_k - \sqrt{1-\bar \alpha_k} \epsilon_\theta(\mathrm x_k, k)}{\sqrt{\bar \alpha_k}}\right)}{\sqrt{1-\bar \alpha_k}} + \sigma \epsilon\\ &= \sqrt{\bar \alpha_s} \left(\frac{\mathrm x_k - \sqrt{1-\bar \alpha_k} \epsilon_\theta(\mathrm x_k, k)}{\sqrt{\bar \alpha_k}}\right) + \sqrt{(1-\bar \alpha_s) - \sigma^2} \cdot \sqrt{1-\bar \alpha_k} \cdot \frac{\epsilon_\theta(\mathrm x_k, k)}{\sqrt{1-\bar \alpha_k}} + \sigma \epsilon\\ &= \sqrt{\bar \alpha_s} \left(\frac{\mathrm x_k - \sqrt{1-\bar \alpha_k} \epsilon_\theta(\mathrm x_k, k)}{\sqrt{\bar \alpha_k}}\right) + \sqrt{(1-\bar \alpha_s) - \sigma^2} \cdot \epsilon_\theta(\mathrm x_k, k) + \sigma \epsilon \end{aligned}

By iteratively applying this process with a reduced number of steps, we can efficiently generate high-quality samples from the diffusion model without the need for additional training.

5. The Choice of σ

When $\sigma = 0$ , the sampling process becomes deterministic, meaning that for a given initial noise input, the output will always be the same. This can be beneficial in scenarios where reproducibility is important, or when we want to ensure that the generated samples are consistent.

On the other hand, when $\sigma > 0$ , the sampling process introduces stochasticity, allowing for more diverse outputs from the same initial noise input. This can be advantageous in creative applications where variability is desired, such as in image generation or other generative tasks.

When $\sigma = \sqrt{\frac{1-\bar \alpha_{t-1}}{1-\bar \alpha_{t}}\beta_t}$ , the sampling process of DDIM is equivalent to that of DDPM.

To prove this, we can substitute this value of $\sigma$ into the DDIM sampling equation:

\begin{aligned} \mathrm x_{t-1} &= \sqrt{\bar \alpha_{t-1}} \left(\frac{\mathrm x_t - \sqrt{1-\bar \alpha_t} \epsilon_\theta(\mathrm x_t, t)}{\sqrt{\bar \alpha_t}}\right) + \sqrt{(1-\bar \alpha_{t-1}) - \sigma^2} \cdot \epsilon_\theta(\mathrm x_t, t) + \sigma \epsilon \\ &= \frac{\sqrt{\bar \alpha_{t-1}}}{\sqrt{\bar \alpha_t}} \mathrm x_t - \frac{\sqrt{\bar \alpha_{t-1}} \sqrt{1-\bar \alpha_t}}{\sqrt{\bar \alpha_t}} \epsilon_\theta(\mathrm x_t, t) + \sqrt{(1-\bar \alpha_{t-1}) - \frac{1-\bar \alpha_{t-1}}{1-\bar \alpha_{t}}\beta_t} \cdot \epsilon_\theta(\mathrm x_t, t) + \sqrt{\frac{1-\bar \alpha_{t-1}}{1-\bar \alpha_{t}}\beta_t} \epsilon \\ &= \frac{1}{\sqrt{\alpha_t}}\left(\mathrm x_t - \frac{1-\alpha_t}{\sqrt{1-\bar \alpha_t}} \epsilon_\theta(\mathrm x_t, t)\right) + \sqrt{\frac{1-\bar \alpha_{t-1}}{1-\bar \alpha_{t}}\beta_t} \epsilon\\ &= \tilde \mu_\theta(\mathrm x_t, t) + \sigma_t \epsilon \end{aligned}

where $\sigma_t = \sqrt{\frac{1-\bar \alpha_{t-1}}{1-\bar \alpha_{t}}\beta_t}$ which is exactly the same as the $\tilde \beta_t = \sqrt{\frac{1-\bar \alpha_{t-1}}{1-\bar \alpha_{t}}\beta_t}$ in DDPM.

DDIM introduce a hyperparameter $\eta$ to control the level of stochasticity in the sampling process. The relationship between $\sigma$ and $\eta$ is defined as:

\sigma_t = \eta \sqrt{\frac{1-\bar \alpha_{t-1}}{1-\bar \alpha_{t}}\beta_t}

When $\eta = 0$ , the sampling process is deterministic, while when $\eta = 1$ , it becomes equivalent to the stochastic sampling of DDPM. By adjusting $\eta$ , users can control the trade-off between sample diversity and fidelity according to their specific needs.

References

1. Denoising diffusion probabilistic models, Jonathan Ho, Ajay Jain, Pieter Abbeel, 2020.

2. Denoising diffusion implicit models, Jiaming Song, Chenlin Meng, Stefano Ermon, CoRR, 2020.

MicDZ's Blog