MicDZ's Blog

From DDPM to DDIM — Accelerate Sampling without Extra Training

From DDPM to DDIM — Accelerate Sampling without Extra Training


1.  Introduction

Although Denoising Diffusion Probabilistic Models (DDPM) have shown great success in generative tasks, their sampling process is often slow due to the need for many iterative steps (e.g., 1000 steps), for each requiring a forward pass through the neural network.

To address this issue, researchers from Stanford University proposed Denoising Diffusion Implicit Models (DDIM), which allow for accelerated sampling without the need for additional training from a pre-trained DDPM model. DDIM achieve this by introducing a non-Markovian diffusion process that enables fewer sampling steps while maintaining high-quality outputs.

2.  Key Concepts

DDIM follow a similar notation to DDPM, where xtx_t represents the noisy data at time step tt, and ϵθ(xt,t)\epsilon_\theta(x_t, t) is the neural network predicting the noise component. The key difference lies in the sampling process.

pθ(x0)=pθ(x0:T)dx1:Tp_\theta(\mathrm x_0) = \int p_\theta(\mathrm x_{0:T}) \mathrm d\mathrm x_{1:T}

where

pθ(x0:T)=p(xT)t=1Tpθ(xt1xt)p_\theta(\mathrm x_{0:T}) = p(\mathrm x_T) \prod_{t=1}^{T} p_\theta(\mathrm x_{t-1}|\mathrm x_t)

DDPM use a special property of the forward process:

q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(\mathrm x_{t}|\mathrm x_0) = \mathcal N(\mathrm x_t; \sqrt{\bar \alpha_t} \mathrm x_0, (1-\bar \alpha_t) \mathbf I)

so that xt\mathrm x_t can be reparameterized as:

xt=αˉtx0+1αˉtϵ,ϵN(0,I)\mathrm x_t = \sqrt{\bar \alpha_t} \mathrm x_0 + \sqrt{1-\bar \alpha_t} \epsilon, \quad \epsilon \sim \mathcal N(0, \mathbf I)

Therefore, the posterior distribution (which the model learns to predict) can be derived as:

q(xt1xt,x0)=q(xtxt1,x0)q(xt1x0)q(xtx0)=N(xt1;μ~t(xt,x0),β~tI)\begin{aligned} q(\mathrm x_{t-1}|\mathrm x_t,\mathrm x_0) &= \frac{q(\mathrm x_t|\mathrm x_{t-1}, \mathrm x_0) q(\mathrm x_{t-1}|\mathrm x_0)}{q(\mathrm x_t|\mathrm x_0)} \\ &= \mathcal N\left(\mathrm x_{t-1}; \tilde \mu_t(\mathrm x_t, \mathrm x_0), \tilde \beta_t \mathbf I\right) \end{aligned}

the derivation of μ~t(xt,x0)\tilde \mu_t(\mathrm x_t, \mathrm x_0) and β~t\tilde \beta_t relies on the Markovian assumption and the reparameterization of xt\mathrm x_t.

The key idea of DDIM is to generalize this posterior distribution to a non-Markovian form with arbitrary transition steps.

3.  Forward Process

The forward process of DDPM is defined as:

q(x1:Tx0)=t=1Tq(xtxt1)q(\mathrm x_{1:T}|\mathrm x_0) = \prod_{t=1}^{T} q(\mathrm x_t|\mathrm x_{t-1})

where

q(xtxt1)=N(xt;1βtxt1, βtI)q(\mathrm x_t|\mathrm x_{t-1}) = \mathcal N(\mathrm x_t; \sqrt{1-\beta_t} \mathrm x_{t-1}, \ \beta_t \mathbf I)

What we are going to use is the property that we can directly sample xt\mathrm x_t from x0\mathrm x_0:

q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(\mathrm x_t|\mathrm x_0) = \mathcal N(\mathrm x_t; \sqrt{\bar \alpha_t} \mathrm x_0, (1-\bar \alpha_t) \mathbf I)

Consider an arbitrary set of time steps k,s{0,1,,T}k,s \in \{0, 1, \ldots, T\} with sk1s \leq k - 1. We want to find the distribution q(xsxk,x0)q(\mathrm x_s|\mathrm x_k, \mathrm x_0), so that we can skip some steps during sampling.

However, unlike the Markovian case, we cannot derive this distribution directly. Instead, we make an assumption that it follows a Gaussian distribution:

q(xsxk,x0)=N(kx0+mxk,σ2I)q(\mathrm x_s |\mathrm x_k, \mathrm x_0) = \mathcal N(k\mathrm x_0 + m\mathrm x_k, \sigma^2 \mathbf I)

where k,m,σk, m, \sigma are coefficients to be determined. We need to solve these three coefficients.

xs=kx0+mxk+σϵ,ϵN(0,I)=kx0+m(αˉkx0+1αˉkϵ)+σϵ,ϵN(0,I)=(k+mαˉk)x0+m1αˉkϵ+σϵN((k+mαˉk)x0,m2(1αˉk)+σ2)\begin{aligned} x_s &= k \mathrm x_0 + m \mathrm x_k + \sigma \epsilon, \quad \epsilon \sim \mathcal N(0, \mathbf I) \\ &= k \mathrm x_0 + m(\sqrt{\bar \alpha_k} \mathrm x_0 + \sqrt{1-\bar \alpha_k} \epsilon') + \sigma \epsilon, \quad \epsilon' \sim \mathcal N(0, \mathbf I) \\ &= (k + m \sqrt{\bar \alpha_k}) \mathrm x_0 + m \sqrt{1-\bar \alpha_k} \epsilon' + \sigma \epsilon\\ &\sim \mathcal N\left((k + m \sqrt{\bar \alpha_k}) \mathrm x_0, m^2 (1-\bar \alpha_k) + \sigma^2\right) \end{aligned}

To ensure consistency with the forward process, we need to match the mean and variance with those of q(xsx0)q(\mathrm x_s|\mathrm x_0):

q(xsx0)=N(αˉsx0,(1αˉs)I)q(\mathrm x_s|\mathrm x_0) = \mathcal N(\sqrt{\bar \alpha_s} \mathrm x_0, (1-\bar \alpha_s) \mathbf I)

This gives us two equations:

k+mαˉk=αˉsm2(1αˉk)+σ2=1αˉs\begin{aligned} k + m \sqrt{\bar \alpha_k} &= \sqrt{\bar \alpha_s} \\ m^2 (1-\bar \alpha_k) + \sigma^2 &= 1-\bar \alpha_s \end{aligned}

We treat σ\sigma as a hyperparameter to control the stochasticity of the sampling process. By solving the above equations, we can obtain:

m=(1αˉs)σ21αˉkk=αˉsmαˉk\begin{aligned} m &= \sqrt{\frac{(1-\bar \alpha_s) - \sigma^2}{1-\bar \alpha_k}} \\ k &= \sqrt{\bar \alpha_s} - m \sqrt{\bar \alpha_k} \end{aligned}

With these coefficients, we can define the DDIM sampling step from xk\mathrm x_k to xs\mathrm x_s as:

q(xsxk,x0)=N(xs;kx0+mxk,σ2I)=N(xs;αˉsx0+(1αˉs)σ2xkαˉkx01αˉk,σ2I)\begin{aligned} q(\mathrm x_s |\mathrm x_k, \mathrm x_0) &= \mathcal N\left(\mathrm x_s; k \mathrm x_0 + m \mathrm x_k, \sigma^2 \mathbf I\right)\\ &= \mathcal N\left(\mathrm x_s; \sqrt{\bar \alpha_s} \mathrm x_0 + \sqrt{(1-\bar \alpha_s) - \sigma^2} \cdot \frac{\mathrm x_k - \sqrt{\bar \alpha_k} \mathrm x_0}{\sqrt{1-\bar \alpha_k}}, \sigma^2 \mathbf I\right) \end{aligned}

The magnitude of σ\sigma controls how stochastic the forward process is; when σ0\sigma \to 0, we reach an extreme case where as long as we observe x0x_0 and xtx_t for some tt, then xt1x_{t-1} becomes known and fixed.

The forward process of DDIM indeed changes the forward process from a Markovian one to a non-Markovian one, and makes an assumption on the form of the transition distribution, this causing a doubt on whether the learned model can still work well under this new forward process.

However, what DDPM truly use in the forward process is the reparameterization trick xt=αˉtx0+1αˉtϵ\mathrm x_t = \sqrt{\bar \alpha_t} \mathrm x_0 + \sqrt{1-\bar \alpha_t} \epsilon, which is still valid in DDIM since the forward process of DDIM doesn’t change this property, and thus the learned model can still be used in DDIM sampling, which we will explain in the next section.

4.  Sampling Process

As we have trained a DDPM model to predict ϵθ(xt,t)\epsilon_\theta(\mathrm x_t, t), we can use it to estimate xt1\mathrm x_{t-1} from xt\mathrm x_t:

xt1=1αt(xt1αt1αˉtϵθ(xt,t))+σtϵ,ϵN(0,I)\begin{aligned} \mathrm x_{t-1} &= \frac{1}{\sqrt{\alpha_t}}\left(\mathrm x_t - \frac{1-\alpha_t}{\sqrt{1-\bar \alpha_t}} \epsilon_\theta(\mathrm x_t, t)\right) + \sigma_t \epsilon, \quad \epsilon \sim \mathcal N(0, \mathbf I)\\ \end{aligned}

Essentially, what we have learned in DDPM is to predict xt1\mathrm x_{t-1} from xt\mathrm x_t:

pθ(xt1xt)=N(xt1;μ~θ(xt,t),σt2I)p_\theta(\mathrm x_{t-1}|\mathrm x_t) = \mathcal N\left(\mathrm x_{t-1}; \tilde \mu_\theta(\mathrm x_t, t), \sigma_t^2 \mathbf I\right)

The goal of DDPM is to minimize the difference between pθ(xt1xt)p_\theta(\mathrm x_{t-1}|\mathrm x_t) and q(xt1xt,x0)q(\mathrm x_{t-1}|\mathrm x_t,\mathrm x_0). However, DDPM perform the reparameterization of xt\mathrm x_t in terms of x0\mathrm x_0 and ϵ\epsilon:

xt=αˉtx0+1αˉtϵ,ϵN(0,I)\mathrm x_t = \sqrt{\bar \alpha_t} \mathrm x_0 + \sqrt{1-\bar \alpha_t} \epsilon, \quad \epsilon \sim \mathcal N(0, \mathbf I)

Thus, in essence, we learn the noise ϵ\epsilon to predict x0\mathrm x_0 from xt\mathrm x_t in DDPM:

x0=xt1αˉtϵαˉt\mathrm x_0 = \frac{\mathrm x_t - \sqrt{1-\bar \alpha_t} \epsilon}{\sqrt{\bar \alpha_t}}

With the learned ϵ\epsilon, during the sampling process of DDIM, we can substitute the predicted noise ϵθ(xt,t)\epsilon_\theta(\mathrm x_t, t) for ϵ\epsilon to give an estimate of x0\mathrm x_0:

x^0(xk,k)=xk1αkϵθ(xk,k)αk\mathrm{\hat{x}}_0(\mathrm{x}_k, k) = \frac{\mathrm{x}_k - \sqrt{1-\alpha_k} \mathrm{\epsilon}_\theta(\mathrm{x}_k, k)}{\sqrt{\alpha_k}}

Then we can generalize this to predict xs\mathrm x_s from xk\mathrm x_k:

xs=αˉsx^0+(1αˉs)σ2xkαˉkx^01αˉk+σϵ,ϵN(0,I)=αˉs(xk1αˉkϵθ(xk,k)αˉk)+(1αˉs)σ2xkαˉk(xk1αˉkϵθ(xk,k)αˉk)1αˉk+σϵ=αˉs(xk1αˉkϵθ(xk,k)αˉk)+(1αˉs)σ21αˉkϵθ(xk,k)1αˉk+σϵ=αˉs(xk1αˉkϵθ(xk,k)αˉk)+(1αˉs)σ2ϵθ(xk,k)+σϵ\begin{aligned} \mathrm x_s &= \sqrt{\bar \alpha_s} \mathrm {\hat{x}}_0 + \sqrt{(1-\bar \alpha_s) - \sigma^2} \cdot \frac{\mathrm x_k - \sqrt{\bar \alpha_k} \mathrm {\hat{x}}_0}{\sqrt{1-\bar \alpha_k}} + \sigma \epsilon, \quad \epsilon \sim \mathcal N(0, \mathbf I)\\ &= \sqrt{\bar \alpha_s} \left(\frac{\mathrm x_k - \sqrt{1-\bar \alpha_k} \epsilon_\theta(\mathrm x_k, k)}{\sqrt{\bar \alpha_k}}\right) + \sqrt{(1-\bar \alpha_s) - \sigma^2} \cdot \frac{\mathrm x_k - \sqrt{\bar \alpha_k} \left(\frac{\mathrm x_k - \sqrt{1-\bar \alpha_k} \epsilon_\theta(\mathrm x_k, k)}{\sqrt{\bar \alpha_k}}\right)}{\sqrt{1-\bar \alpha_k}} + \sigma \epsilon\\ &= \sqrt{\bar \alpha_s} \left(\frac{\mathrm x_k - \sqrt{1-\bar \alpha_k} \epsilon_\theta(\mathrm x_k, k)}{\sqrt{\bar \alpha_k}}\right) + \sqrt{(1-\bar \alpha_s) - \sigma^2} \cdot \sqrt{1-\bar \alpha_k} \cdot \frac{\epsilon_\theta(\mathrm x_k, k)}{\sqrt{1-\bar \alpha_k}} + \sigma \epsilon\\ &= \sqrt{\bar \alpha_s} \left(\frac{\mathrm x_k - \sqrt{1-\bar \alpha_k} \epsilon_\theta(\mathrm x_k, k)}{\sqrt{\bar \alpha_k}}\right) + \sqrt{(1-\bar \alpha_s) - \sigma^2} \cdot \epsilon_\theta(\mathrm x_k, k) + \sigma \epsilon \end{aligned}

By iteratively applying this process with a reduced number of steps, we can efficiently generate high-quality samples from the diffusion model without the need for additional training.

5.  The Choice of σ

When σ=0\sigma = 0, the sampling process becomes deterministic, meaning that for a given initial noise input, the output will always be the same. This can be beneficial in scenarios where reproducibility is important, or when we want to ensure that the generated samples are consistent.

On the other hand, when σ>0\sigma > 0, the sampling process introduces stochasticity, allowing for more diverse outputs from the same initial noise input. This can be advantageous in creative applications where variability is desired, such as in image generation or other generative tasks.

When σ=1αˉt11αˉtβt\sigma = \sqrt{\frac{1-\bar \alpha_{t-1}}{1-\bar \alpha_{t}}\beta_t}, the sampling process of DDIM is equivalent to that of DDPM.

To prove this, we can substitute this value of σ\sigma into the DDIM sampling equation:

xt1=αˉt1(xt1αˉtϵθ(xt,t)αˉt)+(1αˉt1)σ2ϵθ(xt,t)+σϵ=αˉt1αˉtxtαˉt11αˉtαˉtϵθ(xt,t)+(1αˉt1)1αˉt11αˉtβtϵθ(xt,t)+1αˉt11αˉtβtϵ=1αt(xt1αt1αˉtϵθ(xt,t))+1αˉt11αˉtβtϵ=μ~θ(xt,t)+σtϵ\begin{aligned} \mathrm x_{t-1} &= \sqrt{\bar \alpha_{t-1}} \left(\frac{\mathrm x_t - \sqrt{1-\bar \alpha_t} \epsilon_\theta(\mathrm x_t, t)}{\sqrt{\bar \alpha_t}}\right) + \sqrt{(1-\bar \alpha_{t-1}) - \sigma^2} \cdot \epsilon_\theta(\mathrm x_t, t) + \sigma \epsilon \\ &= \frac{\sqrt{\bar \alpha_{t-1}}}{\sqrt{\bar \alpha_t}} \mathrm x_t - \frac{\sqrt{\bar \alpha_{t-1}} \sqrt{1-\bar \alpha_t}}{\sqrt{\bar \alpha_t}} \epsilon_\theta(\mathrm x_t, t) + \sqrt{(1-\bar \alpha_{t-1}) - \frac{1-\bar \alpha_{t-1}}{1-\bar \alpha_{t}}\beta_t} \cdot \epsilon_\theta(\mathrm x_t, t) + \sqrt{\frac{1-\bar \alpha_{t-1}}{1-\bar \alpha_{t}}\beta_t} \epsilon \\ &= \frac{1}{\sqrt{\alpha_t}}\left(\mathrm x_t - \frac{1-\alpha_t}{\sqrt{1-\bar \alpha_t}} \epsilon_\theta(\mathrm x_t, t)\right) + \sqrt{\frac{1-\bar \alpha_{t-1}}{1-\bar \alpha_{t}}\beta_t} \epsilon\\ &= \tilde \mu_\theta(\mathrm x_t, t) + \sigma_t \epsilon \end{aligned}

where σt=1αˉt11αˉtβt\sigma_t = \sqrt{\frac{1-\bar \alpha_{t-1}}{1-\bar \alpha_{t}}\beta_t} which is exactly the same as the β~t=1αˉt11αˉtβt\tilde \beta_t = \sqrt{\frac{1-\bar \alpha_{t-1}}{1-\bar \alpha_{t}}\beta_t} in DDPM.

DDIM introduce a hyperparameter η\eta to control the level of stochasticity in the sampling process. The relationship between σ\sigma and η\eta is defined as:

σt=η1αˉt11αˉtβt\sigma_t = \eta \sqrt{\frac{1-\bar \alpha_{t-1}}{1-\bar \alpha_{t}}\beta_t}

When η=0\eta = 0, the sampling process is deterministic, while when η=1\eta = 1, it becomes equivalent to the stochastic sampling of DDPM. By adjusting η\eta, users can control the trade-off between sample diversity and fidelity according to their specific needs.

References
1. Denoising diffusion probabilistic models, Jonathan Ho, Ajay Jain, Pieter Abbeel, 2020.
2. Denoising diffusion implicit models, Jiaming Song, Chenlin Meng, Stefano Ermon, CoRR, 2020.

, , , — Oct. 28, 2025