MicDZ's Blog

Flow Matching and Diffusion Models

Flow Matching and Diffusion Models


The Big Picture

The final goal of flow matching models and diffusion models is to generate some new items, such as images, text, or audio. In this context, “generation” refers to the process of sampling from a distribution.

We use a probabilistic model to represent the distribution of the data, denoted as p(z)p(z), where zz is the data point. However, we do not have direct access to this distribution. What we have is a set of samples from this distribution, denoted as z1,z2,,zNz_1, z_2, \ldots, z_N, which we call the training set.

So, the generation task includes two main steps:

  1. Learning the Distribution: We need to learn the distribution p(z)p(z) from the training set. This is often done by estimating the parameters of a probabilistic model, such as a neural network.
  2. Sampling from the Distribution: Once we have learned the distribution, we can sample from it to generate new data points. This is often done by using a sampling algorithm, such as Markov Chain Monte Carlo (MCMC) or Langevin dynamics.

Let’s take some specific examples to illustrate the goal of generation.

  • Example 1: Generating Images
    Images: zRH×W×Cz\in\mathbb{R}^{H\times W\times C}, where HH is the height, WW is the width, and CC is the number of channels (e.g., 3 for RGB images).
  • Example 2: Generating Text
    Text: zRLz\in\mathbb{R}^{L}, where LL is the length of the text sequence.
  • Example 3: Generating Videos
    Videos: zRT×H×W×Cz\in\mathbb{R}^{T\times H\times W\times C}, where TT is the number of time steps, HH is the height, WW is the width, and CC is the number of channels (e.g., 3 for RGB videos).

Application of Flow Matching and Diffusion Models

The most famous application of flow matching and diffusion models is DDPM (Denoising Diffusion Probabilistic Models)[2]. It is a generative model that can generate high-quality images by iteratively refining a noisy image.

Ordinary Differential Equations (ODEs)

First, we need to understand the mathematical framework behind flow matching and diffusion models. The key idea is to model the evolution of a data point over time using differential equations.

Here are some necessary mathematical concepts:

Trajectory: A trajectory is a path traced by a moving point in space. In the context of flow matching and diffusion models, it refers to the path taken by the data point as it evolves over time, denoted as XX.

X:[0,T]RdX: [0, T] \to \mathbb{R}^d

Vector Field: A vector field is a function that assigns a vector to each point in space. In the context of flow matching and diffusion models, it refers to the function that describes the evolution of the data point over time, denoted as uu.

u:Rd×[0,T]Rdu: \mathbb{R}^d \times [0, T] \to \mathbb{R}^d

Ordinary differential equations (ODEs) describe the evolution of the data point over time.

An ODE is typically written in the form:

{ddtX(t)=ut(X(t))X(0)=x0\left\{ \begin{aligned} \frac{\mathrm d}{\mathrm dt}X(t) &= u_t(X(t)) \\ X(0) &= x_0 \end{aligned} \right.

where X(t)X(t) is the data point at time tt, ut(X(t))u_t(X(t)) is the vector field that describes the evolution of the data point, and x0x_0 is the initial condition.

When utu_t is continuously differentiable, we can use the Picard-Lindelöf theorem to guarantee the existence and uniqueness of the solution to the ODE. In the practice of machine learning, we implicitly assume that the vector field utu_t is continuously differentiable.

Flow: A set of solution trajectories of the ODE. The flow is denoted as ψ\psi, which represents the data point at time tt given the initial condition x0x_0.

ψ:[0,T]×RdRd\psi: [0, T] \times \mathbb{R}^d \to \mathbb{R}^d

A Simple Example

Simple vector field:

ut(x)=θxu_t(x) = -\theta x

Flow can be computed as:

ψ(t,x0)=exp(θt)x0\psi(t, x_0) = \exp (-\theta t) x_0

Proof:

  1. Initial condition: ψ(0,x0)=x0\psi(0, x_0) = x_0.
  2. Differential equation:

ddtψ(t,x0)=θexp(θt)x0=θψ(t,x0)=ut(ψ(t,x0))\frac{\mathrm d}{\mathrm dt} \psi(t, x_0) = -\theta \exp (-\theta t) x_0 = -\theta \psi(t, x_0)= u_t(\psi(t, x_0))

Euler Method

\begin{algorithm}
\caption{Euler Method}
\begin{algorithmic}
\REQUIRE Vector field $u_t$, initial condition $x_0$, number of steps $n$
\STATE Set $t = 0$
\STATE Set step size $h = \frac{1}{n}$
\STATE Set $X_0 = x_0$
\FOR{$i = 1, \ldots, n-1$}
    \STATE $X_{t+h} = X_t + h u_t(X_t)$  \COMMENT{Small step forward}
    \STATE Update $t \gets t + h$
\ENDFOR
\RETURN $X_0, X_h, X_{2h}, \ldots, X_1$ \COMMENT{Return the trajectory}
\end{algorithmic}
\end{algorithm}

Flow Model

Neural Network of Flow Model:

utθ(x):Rd×[0,T]Rdu_t^\theta(x): \mathbb{R}^d\times [0, T] \to \mathbb{R}^d

where θ\theta are the parameters of the neural network.

Random init:

X0pinitX_0 \sim p_{\text{init}}

ODE:

ddtXt=utθ(Xt)\frac{\mathrm d}{\mathrm dt}X_t = u_t^\theta(X_t)

Goal:

XTpdataX_T \sim p_{\text{data}}

With the utθu_t^\theta trained, we can use the Euler method to sample from the flow model. The Euler method is a simple numerical method for solving ordinary differential equations (ODEs) by approximating the solution at discrete time steps.

\begin{algorithm}
\caption{Sampling from a Flow Model with Euler method}
\begin{algorithmic}
\Require Neural network vector field $u_{t}^{\theta}$, number of steps $n$
\State Set $t = 0$
\State Set step size $h = \frac{1}{n}$
\State Draw a sample $X_0 \sim p_{\text{init}}$ \Comment{Random initialization!}
\For{$i = 1, \ldots, n-1$}
    \State $X_{t+h} = X_t + h u_{t}^{\theta}(X_t)$
    \State Update $t \gets t + h$
\EndFor
\State \Return $X_1$ \Comment{Return final point}
\end{algorithmic}
\end{algorithm}

Stochastic Differential Equations (SDEs)

In stochastic process, we talk about random variables that evolve over time. A stochastic process is a collection of random variables indexed by time.

Vector Field:

u:Rd×[0,T]Rdu: \mathbb{R}^d\times [0, T]\to \mathbb{R}^d

Diffusion Coefficient:

σ:[0,T]Rd\sigma: [0, T] \to \mathbb{R}^d

SDE:

{dXt=ut(Xt)dt+σtdWtX0=x0\left\{ \begin{aligned} \mathrm d X_t &= u_t(X_t)\mathrm dt + \sigma_t \mathrm dW_t \\ X_0 &= x_0 \end{aligned} \right.

where WtW_t is the standard Wiener process (Brownian motion).

Brownian Motion

Stochastic process WtW_t is called Brownian motion if it satisfies the following properties:

  1. W0=0W_0 = 0.
  2. WtW_t has independent increments: for any 0t1<t2<<tnT0 \leq t_1 < t_2 < \ldots < t_n \leq T, the increments Wt2Wt1,Wt3Wt2,,WtnWtn1W_{t_2} - W_{t_1}, W_{t_3} - W_{t_2}, \ldots, W_{t_n} - W_{t_{n-1}} are independent random variables.
  3. Gaussian increments: for any 0t1<t2T0 \leq t_1 < t_2 \leq T, Wt2Wt1N(0,t2t1)W_{t_2} - W_{t_1} \sim \mathcal{N}(0, t_2 - t_1).

When in ODE, we can do following equivalence:

ddtXt=ut(Xt)Xt+h=Xt+hut(Xt)+o(h)(limh0o(h)=0)\frac{\mathrm d}{\mathrm dt}X_t = u_t(X_t) \Leftrightarrow X_{t+h} = X_t + h u_t(X_t) + o(h) (\lim_{h\to 0} o(h) = 0)

When in SDE:

dXt=ut(Xt)dt+σtdWtXt+h=Xt+hut(Xt)+σtΔWt+o(h)(limh0o(h)=0)\mathrm dX_t = u_t(X_t)\mathrm dt + \sigma_t \mathrm dW_t \Leftrightarrow X_{t+h} = X_t + h u_t(X_t) + \sigma_t \Delta W_t + o(h) (\lim_{h\to 0} o(h) = 0)

Reason of doing this equivalence is that the random variable WtW_t is not differentiable, so we can not use the derivative to describe the evolution of the random variable. Instead, we use the increment ΔWt=Wt+hWt\Delta W_t = W_{t+h} - W_t to describe the evolution of the random variable.

Same as ODEs, SDEs have the Picard-Lindelöf theorem to guarantee the existence and uniqueness of the solution to the SDE. In the practice of machine learning, we implicitly assume that the vector field utu_t and diffusion coefficient σt\sigma_t are continuously differentiable.

\begin{algorithm}
\caption{Sampling from a SDE (Euler-Maruyama method)}
\begin{algorithmic}
\Require Vector field $u_t$, number of steps $n$, diffusion coefficient $\sigma_t$
\State Set $t = 0$
\State Set step size $h = \frac{1}{n}$
\State Set $X_0 = x_0$
\For{$i = 1, \ldots, n - 1$}
    \State Draw a sample $\epsilon \sim \mathcal{N}(0, I_d)$
    \State $X_{t+h} = X_t + h u_t(X_t) + \sigma_t \sqrt{h} \epsilon$  \COMMENT{Add additional noise with var=h scaled by diffusion coefficient $\sigma_t$}
    \State Update $t \leftarrow t + h$
\EndFor
\State \textbf{return} $X_0, X_h, X_{2h}, X_{3h}, \ldots, X_1$
\end{algorithmic}
\end{algorithm}

Using h\sqrt{h} is because (Wt+hWt)N(0,h)(W_{t+h} - W_t) \sim \mathcal{N}(0, h).

We can visualize the simple example in 2.1.

Deterministic solution

Stochastic trajectories

Diffusion Model

Compared to flow models, diffusion models use SDE to model the evolution of the data point over time. The key difference is that diffusion models add noise to the data point at each time step, which allows them to generate more diverse samples.

\begin{algorithm}
\caption{Sampling from a Diffusion Model (Euler-Maruyama method)}
\begin{algorithmic}
\Require Neural network $u_t^\theta$, number of steps $n$, diffusion coefficient $\sigma_t$
\State Set $t = 0$
\State Set step size $h = \frac{1}{n}$
\State Draw a sample $X_0 \sim p_{\text{init}}$
\For{$i = 1, \ldots, n - 1$}
    \State Draw a sample $\epsilon \sim \mathcal{N}(0, I_d)$
    \State $X_{t+h} = X_t + h u_t^\theta(X_t) + \sigma_t \sqrt{h} \epsilon$
    \State Update $t \leftarrow t + h$
\EndFor
\State \Return $X_1$
\end{algorithmic}
\end{algorithm}

Training Target of Flow Matching and Diffusion Models

Conditional and Marginal Probability Paths

Key terminology:

  • “Conditional”: Per single data point
  • “Marginal”: Across distribution of data points

Dirac Distribution:

zRdRdz\in\mathbb{R}^d\to\mathbb{R}^d

xδzx=zx\sim \delta_z \Leftrightarrow x = z

  • Conditional Probability Path: pt(z)p_t(\cdot|z)

Example: Conditional Gaussian Probability Path

  1. pt(z)p_t(\cdot|z) is a distribution over Rd\mathbb{R}^d for each t[0,T]t\in[0, T] and zRdz\in\mathbb{R}^d.
  2. p0(z)=pinitp_0(\cdot|z) = p_{\text{init}}, pt(z)=δzp_t(\cdot|z)=\delta_z

pt(z)=N(αtz,βt2Id)p_t(\cdot|z) = \mathcal{N}(\alpha_t z, \beta_t^2 \mathbf{I}_d)

where αt,βt\alpha_t, \beta_t are called the noise schedule. They have the following properties:

  1. α0=0,β0=1\alpha_0 = 0, \beta_0 = 1
  2. αT=1,βT=0\alpha_T = 1, \beta_T = 0

Time:

  • Marginal Probability Path: zdata,xt(z)xt()z\sim _{\text{data}}, x\sim _t(\cdot|z) \Rightarrow x\sim _t(\cdot)

{pt(x)=pt(xz)pdata(z)dzp0=pinit\left\{ \begin{aligned} p_t(x)& = \int p_t(x|z) p_{\text{data}}(z) \mathrm dz\\ p_0 &= p_{\text{init}} \end{aligned} \right.

Conditional and Marginal Vector Fields

Conditional Vector Field: uttarget(xz)u_t^{\text{target}}(x|z)

X0pinit,ddtXt=uttarget(Xtz)XTpt(z)X_0\sim p_{\text{init}}, \frac{\mathrm d}{\mathrm dt} X_t = u_t^{\text{target}}(X_t|z)\Rightarrow X_T\sim p_t(\cdot|z)

Example: Conditional Gaussian Vector Field

uttarget(xz)=(α˙tβ˙tβt)z+β˙tβtxu_t^{\text{target}}(x|z) = (\dot\alpha_t - \frac{\dot\beta_t}{\beta_t})z+\frac{\dot\beta_t}{\beta_t}x

where α˙t,β˙t\dot\alpha_t, \dot\beta_t are the time derivatives of αt,βt\alpha_t, \beta_t.

Marginal Vector Field:

uttarget(x)=uttarget(xz)pt(xz)pdata(z)pt(x)dzu_t^{\text{target}}(x) = \int u_t^{\text{target}}(x|z)\frac{p_t(x|z)p_{\text{data}}(z)}{p_t(x)} \mathrm dz

satisfies:

X0pinit,ddtXt=uttarget(Xt)XtPt()XTpdataX_0\sim p_{\text{init}}, \frac{\mathrm d}{\mathrm dt} X_t = u_t^{\text{target}}(X_t)\Rightarrow X_t\sim P_t(\cdot)\Rightarrow X_T\sim p_{\text{data}}

Continuity Equation

Follow probability path:Xtptddtpt(x)=div(ptut)(x)\text{Follow probability path:} X_t\sim p_t \Leftrightarrow \frac{\mathrm d}{\mathrm d t}p_t(x) = -\text{div}(p_t u_t)(x)

The equation iondicates that the change in probability density at point xx over time is equal to the inflow of probablity mass from uu.

Proof of the formula of the marginal vector field:

ddtpt(x)=ddtpt(xz)pdata(z)dz=div(pt(xz)uttarget(xz))pdata(z)dz=div(pt(xz)uttarget(xz)pdata(z)dz)=div(pt(x)uttarget(x))=div(pt(x)uttarget(xz)pt(xz)pdata(z)pt(x)dz)=div(pt(x)uttarget(x))\begin{aligned} \frac{\mathrm d}{\mathrm dt}p_t(x) &= \int \frac{\mathrm d}{\mathrm dt}p_t(x|z)p_{\text{data}}(z) \mathrm dz \\ &= \int -\text{div}(p_t(x|z)u_t^{\text{target}}(x|z))p_{\text{data}}(z) \mathrm dz \\ &= -\text{div}(\int p_t(x|z)u_t^{\text{target}}(x|z)p_{\text{data}}(z) \mathrm dz) \\ &= -\text{div}(p_t(x)u_t^{\text{target}}(x)) \\ &= -\text{div}(p_t(x)\int u_t^{\text{target}}(x|z)\frac{p_t(x|z)p_{\text{data}}(z)}{p_t(x)} \mathrm dz) \\ &= -\text{div}(p_t(x)u_t^{\text{target}}(x)) \end{aligned}

Conditional and Marginal Score Function

Conditional Score Function: xlogpt(xz)\nabla_x\log p_t(x|z)

Example: Conditional Gaussian Score Function

xlogpt(xz)=xlogexp(12βt2(xαtz)2)=xαtzβt2\nabla_x\log p_t(x|z) = \nabla_x\log \exp\left(-\frac{1}{2\beta_t^2}(x-\alpha_t z)^2\right) = -\frac{x-\alpha_t z}{\beta_t^2}

Marginal Score Function: xlogpt(x)\nabla_x\log p_t(x)

xlogpt(x)=xpt(x)pt(x)=xpt(xz)pdata(z)dzpt(x)=xlogpt(xz)pt(xz)pdata(z)pt(x)dz\nabla_x\log p_t(x) = \frac{\nabla_x p_t(x)}{p_t(x)} = \frac{\int \nabla_x p_t(x|z)p_{\text{data}}(z) \mathrm dz}{p_t(x)} = \int \nabla_x\log p_t(x|z)\frac{p_t(x|z)p_{\text{data}}(z)}{p_t(x)} \mathrm dz

Summary of Key Concepts

Conditional Prob. Path, Vector Field, and Score Function:

Notation Key property Gaussian example
Conditional Probability Path pt(z)p_t(\cdot|z) Interpolates pinitp_{\text{init}} and a data point zz N(αtz,βt2Id)\mathcal{N}(\alpha_t z, \beta_t^2 I_d)
Conditional Vector Field uttarget(xz)u_t^{\text{target}}(x|z) ODE follows conditional path (α˙tβ˙tβtαt)z+β˙tβtx\left( \dot{\alpha}_t - \frac{\dot{\beta}_t}{\beta_t} \alpha_t \right) z + \frac{\dot{\beta}_t}{\beta_t} x
Conditional Score Function logpt(xz)\nabla \log p_t(x|z) Gradient of log-likelihood xαtzβt2\frac{x - \alpha_t z}{\beta_t^2}

Marginal Prob. Path, Vector Field, Score Function:

Description Notation Key property Formula
Marginal Probability Path ptp_t Interpolates pinitp_{\text{init}} and pdatap_{\text{data}} pt(xz)pdata(z)dz\int p_t(x \mid z) p_{\text{data}}(z) \, \mathrm dz
Marginal Vector Field uttarget(x)u_t^{\text{target}}(x) ODE follows marginal path uttarget(xz)pt(xz)pdata(z)pt(x)dz\int u_t^{\text{target}}(x \mid z) \frac{p_t(x \mid z) p_{\text{data}}(z)}{p_t(x)} \, \mathrm dz
Marginal Score Function logpt(x)\nabla \log p_t(x) Can be used to convert ODE target to SDE logpt(xz)pt(xz)pdata(z)pt(x)dz\int \nabla \log p_t(x \mid z) \frac{p_t(x \mid z) p_{\text{data}}(z)}{p_t(x)} \, \mathrm dz

This video demonstrates the evolution of distribution from original noise distribution to the target distribution.

From ODEs to SDEs

For any σt0\sigma_t\geq 0, we can convert the ODEs to SDEs by adding a noise term:

X0pinit,dXt=[uttarget(Xt)+σt22logpt(Xt)]dt+σtdWtX_0\sim p_{\text{init}}, \mathrm dX_t = [u_t^{\text{target}}(X_t) + \frac{\sigma_t^2}{2} \nabla \log p_t(X_t)] \mathrm dt + \sigma_t \mathrm dW_t

where dWtdW_t is a Wiener process.

This equation can be interpreted as a Fokker-Planck equation that describes the evolution of the probability density function pt(x)p_t(x) over time.

Fokker-Planck Equation:

dpt(x)dt=div(ptut)(x)+σt22Δpt(x)\frac{\mathrm d p_t(x)}{\mathrm d t} = -\text{div}(p_t u_t)(x) + \frac{\sigma_t^2}{2} \Delta p_t(x)

The equation indicates that the change in probability density at point xx over time is equal to the inflow of probability mass from uu plus the heat dispersion term σt22Δpt(x)\frac{\sigma_t^2}{2} \Delta p_t(x).

Train Flow Matching and Diffusion Models

Flow Matching

Model: utθ(x)u_t^\theta(x) (θ\theta is the parameter of the neural network)

Goal: utθ(x)uttarget(x)u_t^\theta(x) \approx u_t^{\text{target}}(x)

Flow Matching Loss:

Lfm(θ)=EtUnif(0,T),zpdata,xpt(z)[utθ(x)uttarget(x)2]\mathcal{L}_{\text{fm}}(\theta) = \mathbb{E}_{t\sim\text{Unif}(0, T), z\sim p_{\text{data}}, x\sim p_t(\cdot|z)}[\|u_t^\theta(x) - u_t^{\text{target}}(x)\|^2]

But this is not tractable, because for each step, we need to calculate the target vector field uttarget(x)u_t^{\text{target}}(x), which requires looping over all data points zpdataz\sim p_{\text{data}}. For large datasets, this is computationally expensive.

Conditional Flow Matching Loss:

Lcfm(θ)=EtUnif(0,T),zpdata,xpt(z)[utθ(x)uttarget(xz)2]\mathcal{L}_{\text{cfm}}(\theta) = \mathbb{E}_{t\sim\text{Unif}(0, T), z\sim p_{\text{data}}, x\sim p_t(\cdot|z)}[\|u_t^\theta(x) - u_t^{\text{target}}(x|z)\|^2]

Minimizing the conditional flow matching loss is surrogate to minimizing the flow matching loss, because:

Lcfm(θ)=Lfm(θ)+C\begin{aligned} \mathcal{L}_{\text{cfm}}(\theta) &= \mathcal{L}_{\text{fm}}(\theta) + C \\ \end{aligned}

where CC is a constant that does not depend on θ\theta.

Proof: The proof works by expanding the mean-squared error into three parts using the following identity:

ab2=a22aTb+b2|| a - b||^2 = ||a||^2 - 2 a^T b + ||b||^2

Lfm(θ)=EtUnif,xpt[utθ(x)uttarget(x)2]=EtUnif,xpt[utθ(x)22utθ(x)Tuttarget(x)+uttarget(x)2]=EtUnif,xpt[utθ(x)2]2EtUnif,xpt[utθ(x)Tuttarget(x)]+EtUnif,xpt[uttarget(x)2]=:C1=EtUnif,zpdata,xpt(z)[utθ(x)2]2EtUnif,xpt[utθ(x)Tuttarget(x)]+C1\begin{aligned} \mathcal{L}_{\text{fm}}(\theta) &= \mathbb{E}_{t \sim \text{Unif}, x \sim p_t} \left[ \| u_t^{\theta}(x) - u_t^{\text{target}}(x) \|^2 \right]\\ &= \mathbb{E}_{t \sim \text{Unif}, x \sim p_t} \left[ \| u_t^{\theta}(x) \|^2 - 2u_t^{\theta}(x)^T u_t^{\text{target}}(x) + \| u_t^{\text{target}}(x) \|^2 \right] \\ &= \mathbb{E}_{t \sim \text{Unif}, x \sim p_t} \left[ \| u_t^{\theta}(x) \|^2 \right] - 2 \mathbb{E}_{t \sim \text{Unif}, x \sim p_t} \left[ u_t^{\theta}(x)^T u_t^{\text{target}}(x) \right] \\ &\quad + \underbrace{\mathbb{E}_{t \sim \text{Unif}, x \sim p_t} \left[ \| u_t^{\text{target}}(x) \|^2 \right]}_{=: C_1}\\ &= \mathbb{E}_{t \sim \text{Unif}, z \sim p_{\text{data}}, x \sim p_t(\cdot | z)} \left[ \| u_t^{\theta}(x) \|^2 \right] - 2 \mathbb{E}_{t \sim \text{Unif}, x \sim p_t} \left[ u_t^{\theta}(x)^T u_t^{\text{target}}(x) \right] + C_1 \end{aligned}

Lcfm(θ)=EtUnif,zpdata,xpt(z)[utθ(x)uttarget(xz)2]=EtUnif,zpdata,xpt(z)[utθ(x)22utθ(x)Tuttarget(xz)+uttarget(xz)2]=EtUnif,zpdata,xpt(z)[utθ(x)2]2EtUnif,zpdata,xpt(z)[utθ(x)Tuttarget(xz)]+EtUnif,zpdata,xpt(z)[uttarget(xz)2]=:C2=EtUnif,zpdata,xpt(z)[utθ(x)2]2EtUnif,xpt[utθ(x)Tuttarget(x)]+C2\begin{aligned} \mathcal{L}_{\text{cfm}}(\theta) &= \mathbb{E}_{t \sim \text{Unif}, z \sim p_{\text{data}}, x \sim p_t(\cdot | z)} \left[ \| u_t^{\theta}(x) - u_t^{\text{target}}(x|z) \|^2 \right] \\ &= \mathbb{E}_{t \sim \text{Unif}, z \sim p_{\text{data}}, x \sim p_t(\cdot | z)} \left[ \| u_t^{\theta}(x) \|^2 - 2 u_t^{\theta}(x)^T u_t^{\text{target}}(x|z) + \| u_t^{\text{target}}(x|z) \|^2 \right] \\ &= \mathbb{E}_{t \sim \text{Unif}, z \sim p_{\text{data}}, x \sim p_t(\cdot | z)} \left[ \| u_t^{\theta}(x) \|^2 \right] - 2 \mathbb{E}_{t \sim \text{Unif}, z \sim p_{\text{data}}, x \sim p_t(\cdot | z)} \left[ u_t^{\theta}(x)^T u_t^{\text{target}}(x|z) \right] \\ &\quad + \underbrace{\mathbb{E}_{t \sim \text{Unif}, z \sim p_{\text{data}}, x \sim p_t(\cdot | z)} \left[ \| u_t^{\text{target}}(x|z) \|^2 \right]}_{=: C_2} \\ &= \mathbb{E}_{t \sim \text{Unif}, z \sim p_{\text{data}}, x \sim p_t(\cdot | z)} \left[ \| u_t^{\theta}(x) \|^2 \right] - 2 \mathbb{E}_{t \sim \text{Unif}, x \sim p_t} \left[ u_t^{\theta}(x)^T u_t^{\text{target}}(x) \right] + C_2 \end{aligned}

What we need to prove is the relationship between EtUnif,xpt[utθ(x)Tuttarget(x)]\mathbb{E}_{t \sim \text{Unif}, x \sim p_t} \left[ u_t^{\theta}(x)^T u_t^{\text{target}}(x) \right] and EtUnif,zpdata,xpt(z)[utθ(x)Tuttarget(xz)]\mathbb{E}_{t \sim \text{Unif}, z \sim p_{\text{data}}, x \sim p_t(\cdot | z)} \left[ u_t^{\theta}(x)^T u_t^{\text{target}}(x|z) \right].

EtUnif,xpt[utθ(x)Tuttarget(x)]=01pt(x)utθ(x)Tuttarget(x)dxdt=01pt(x)utθ(x)T[uttarget(xz)pt(xz)pdata(z)pt(x)dz]dxdt=01utθ(x)Tuttarget(xz)pt(xz)pdata(z)dzdxdt=EtUnif,zpdata,xpt(z)[utθ(x)Tuttarget(xz)]\begin{aligned} \mathbb{E}_{t\sim\text{Unif}, x \sim p_t}\left[u^{\theta}_t(x)^T u^{\text{target}}_t(x)\right] &= \int_0^1 \int p_t(x) u^{\theta}_t(x)^T u^{\text{target}}_t(x) \, dx \, dt \\ &= \int_0^1 \int p_t(x) u^{\theta}_t(x)^T \left[\int u^{\text{target}}_t(x|z) \frac{p_t(x|z) p_{\text{data}}(z)}{p_t(x)} \, dz \right] \, dx \, dt \\ &= \int_0^1 \int \int u^{\theta}_t(x)^T u^{\text{target}}_t(x|z) p_t(x|z) p_{\text{data}}(z) \, dz \, dx \, dt \\ &= \mathbb{E}_{t\sim\text{Unif}, z\sim p_{\text{data}}, x \sim p_t(\cdot|z)}\left[u^{\theta}_t(x)^T u^{\text{target}}_t(x|z)\right] \end{aligned}

So the two losses only has a C=C1C2C = C_1 - C_2 gap, which is a constant that does not depend on θ\theta, which means minimizing the conditional flow matching loss is equivalent to minimizing the flow matching loss.

Once utθ(x)u_t^{\theta}(x) is learned, we can use it to generate samples X1X_1 from the data distribution by reversing the flow, via e.g., algorithm 1.

dXt=utθ(Xt)dt+σtdWt\mathrm dX_t = -u_t^{\theta}(X_t) \mathrm dt + \sigma_t \mathrm dW_t

Example: Flow Matching for Gaussian Conditional Probability Path
Let us return to the example of Gaussian probability paths pt(z)=N(αtz,βt2Id)p_t(\cdot|z) = \mathcal{N}(\alpha_t z, \beta_t^2 I_d), we may sample from the conditional path via:

ϵN(0,Id)xt=αtz+βtϵN(αtz,βt2Id)\epsilon \sim \mathcal{N}(0, I_d)\Rightarrow x_t = \alpha_t z + \beta_t \epsilon \sim \mathcal{N}(\alpha_t z, \beta_t^2 I_d)

The conditional vector field is:

uttarget(xz)=(α˙tβ˙tβtαt)z+β˙tβtxu_t^{\text{target}}(x|z) = \left(\dot\alpha_t - \frac{\dot\beta_t}{\beta_t}\alpha_t\right) z + \frac{\dot\beta_t}{\beta_t} x

Plug in the conditional vector field into the flow matching loss:

Lcfm(θ)=EtUnif(0,T),zpdata,xpt(z)[utθ(αtz+βtϵ)(α˙tz+β˙tϵ)2]\mathcal{L}_{\text{cfm}}(\theta) = \mathbb{E}_{t\sim\text{Unif}(0, T), z\sim p_{\text{data}}, x\sim p_t(\cdot|z)}\left[\left\|u_t^\theta(\alpha_t z + \beta_t \epsilon) - (\dot\alpha_tz+\dot\beta_t\epsilon)\right\|^2\right]

The simplicity of Lcfm(θ)\mathcal{L}_{\text{cfm}}(\theta): We sample a data point zz, sample some noise ϵ\epsilon, and then we take a mean squared error.

Considering a specific case where αt=t\alpha_t = t and βt=1t\beta_t = 1-t, the corresponding probability path pt(xz)=N(tz,(1t)2Id)p_t(x|z)=\mathcal{N}(tz, (1-t)^2 I_d) is called CondOT Probability Path. So that conditional flow matching loss becomes:

Lcfm(θ)=EtUnif(0,T),zpdata,ϵN(0,Id)[utθ(tz+(1t)ϵ)(zϵ)2]\mathcal{L}_{\text{cfm}}(\theta) = \mathbb{E}_{t\sim\text{Unif}(0, T), z\sim p_{\text{data}}, \epsilon\sim\mathcal{N}(0, I_d)}\left[\left\|u_t^\theta(tz + (1-t)\epsilon) - (z-\epsilon)\right\|^2\right]

\begin{algorithm}
\caption{Flow Matching Training Procedure (here for Gaussian CondOT path $p_t(x|z) = \mathcal{N}(tz, (1-t)^2)$)}
\begin{algorithmic}
\Require A dataset of samples $z \sim p_{\text{data}}$, neural network $u_t^\theta$
\For {each mini-batch of data}
    \State Sample a data example $z$ from the dataset.
    \State Sample a random time $t \sim \text{Unif}[0,1]$.
    \State Sample noise $\epsilon \sim \mathcal{N}(0, I_d)$ 
    \Comment{(General case: $x \sim p_t(\cdot|z)$)}
    \State Set $x = tz + (1-t)\epsilon$
    \State Compute loss $\mathcal{L}(\theta) = \|u_t^\theta(x) - (z - \epsilon)\|^2$ 
    \Comment{General case: $\mathcal{L}(\theta) = \|u_t^\theta(x) - u_t^{\text{target}}(x|z)\|^2$}
\State Update the model parameters $\theta$ via gradient descent on $\mathcal{L}(\theta)$.

\EndFor
\end{algorithmic}
\end{algorithm}

Score Matching

Extending the algorithm from ODEs to SDEs, we got the score matching.

dXt=[uttarget(Xt)+σt22logpt(Xt)]dt+σtdWt\mathrm dX_t = [u_t^{\text{target}}(X_t) + \frac{\sigma_t^2}{2}\nabla \log p_t(X_t)] \mathrm dt + \sigma_t \mathrm dW_t

where uttargetu_t^{\text{target}} is the marginal vector field, and logpt(Xt)\nabla \log p_t(X_t) is the marginal score function, represented by the formula:

logpt(x)=logpt(xz)pt(Xtz)pdata(z)pt(x)dz\nabla \log p_t(x) = \int \nabla \log p_t(x|z) \frac{p_t(X_t|z)p_{\text{data}}(z)}{p_t(x)} \mathrm dz

We can define the score matching loss as:

Lsm(θ)=EtUnif,zpdata,xpt(z)[stθ(x)logpt(x)2]\mathcal{L}_{\text{sm}}(\theta) = \mathbb{E}_{t\sim\text{Unif}, z\sim p_{\text{data}}, x\sim p_t(\cdot|z)}\left[\left\|s_t^\theta(x) - \nabla \log p_t(x)\right\|^2\right]

Similar to the flow matching loss, we can use the conditional score function to approximate the marginal score function:

Lcsm(θ)=EtUnif,zpdata,xpt(z)[stθ(x)logpt(xz)2]\mathcal{L}_{\text{csm}}(\theta) = \mathbb{E}_{t\sim\text{Unif}, z\sim p_{\text{data}}, x\sim p_t(\cdot|z)}\left[\left\|s_t^\theta(x) - \nabla \log p_t(x|z)\right\|^2\right]

Proof: It is similar to the flow matching loss, where we can simply replace the target vector field uttarget(x)u_t^{\text{target}}(x) with the target score function logpt(x)\nabla \log p_t(x).

After training the score function stθ(x)s_t^\theta(x), we can use it to generate samples from the data distribution using arbitrary diffusion coefficients σt0\sigma_t\geq 0, by simulating the SDE:

X0pinit,dXt=[utθ(Xt)+σt22stθ(Xt)]dt+σtdWtX_0\sim p_{\text{init}}, \mathrm dX_t = [u_t^\theta(X_t) + \frac{\sigma_t^2}{2}s_t^\theta(X_t)] \mathrm dt + \sigma_t \mathrm dW_t

In theory, every σt\sigma_t should give samples X1pdataX_1\sim p_{\text{data}} at perfect training, but in practice, we encounter two issues:

  1. numerical erros by simulating the SDE
  2. training errors (the model utθu_t^\theta is not exactly equal to uttargetu_t^\text{target})

At first sight, it seems that we have to learn both utθu_t^\theta and stθs_t^\theta, however, we can often directly regress utθu_t^\theta and stθs_t^\theta in a single network with two outputs. Further as we will see now for the special case of the Gaussian probability path, stθs_t^\theta and utθu_t^\theta may be converted into one another so that we don’t have to train them separately.

Example: Denoising Diffusion Models: Score Matching for Gaussian Probability Paths
The conditional score logpt(xz)\nabla \log p_t(x|z) can be expressed as follows:

logpt(xz)=xαtzβt2\nabla \log p_t(x|z) = -\frac{x-\alpha_t z}{\beta_t^2}

Pluging this formula into the conditional score matching loss, we get:

Lcsm(θ)=EtUnif,zpdata,xpt(z)[stθ(x)+xαtzβt22]=EtUnif,zpdata,xpt(z)[stθ(αtz+βtϵ)+βtϵβt22]=EtUnif,zpdata,ϵN(0,Id)[1βt2stθ(αtz+βtϵ)+ϵ2]\begin{aligned} \mathcal{L}_{\text{csm}}(\theta) &= \mathbb{E}_{t\sim\text{Unif}, z\sim p_{\text{data}}, x\sim p_t(\cdot|z)}\left[\left\|s_t^\theta(x) + \frac{x-\alpha_t z}{\beta_t^2}\right\|^2\right]\\ &= \mathbb{E}_{t\sim\text{Unif}, z\sim p_{\text{data}}, x\sim p_t(\cdot|z)}\left[\left\|s_t^\theta(\alpha_t z + \beta_t \epsilon) + \frac{\beta_t \epsilon}{\beta_t^2}\right\|^2\right]\\ &= \mathbb{E}_{t\sim\text{Unif}, z\sim p_{\text{data}}, \epsilon\sim\mathcal{N}(0, I_d)}\left[\frac{1}{\beta_t^2}\left\|s_t^\theta(\alpha_t z + \beta_t \epsilon) + \epsilon\right\|^2\right]\\ \end{aligned}

The network stθs_t^\theta essentially learns to predict the noise ϵ\epsilon that used to corrupt the data point zz. So this process is called denoising score matching.

It was soon realize that the above loss is numerically unstable when βt0\beta_t\to 0. In some of the first works on denoising diffusion models(like DDPM[2]), it was therefore proposed to use the following loss instead:

βtstθ(x)=ϵtθ(x)LDDPM(θ)=EtUnif,zpdata,xpt(z)[ϵtθ(αtz+βtϵ)ϵ2]-\beta_ts_t^\theta(x) = \epsilon_t^\theta(x) \Rightarrow \mathcal{L}_{\text{DDPM}}(\theta) = \mathbb{E}_{t\sim\text{Unif}, z\sim p_{\text{data}}, x\sim p_t(\cdot|z)}\left[\left\|\epsilon_t^\theta(\alpha_t z + \beta_t \epsilon) - \epsilon\right\|^2\right]

The network ϵtθ\epsilon_t^\theta learns to predict the noise ϵ\epsilon that used to corrupt the data point zz, but it is scaled by βt-\beta_t.

\begin{algorithm}
\caption{Score Matching Training Procedure for Gaussian probability path}
\begin{algorithmic}
\Require A dataset of samples $z \sim p_{\text{data}}$, score network $s_t^\theta$ or noise predictor $\epsilon_t^\theta$
\For{each mini-batch of data}
    \State Sample a data example $z$ from the dataset.
    \State Sample a random time $t \sim \text{Unif}[0,1]$.
    \State Sample noise $\epsilon \sim \mathcal{N}(0, I_d)$
    \State Set $x_t = \alpha_t z + \beta_t \epsilon$ 
    \Comment{(General case: $x_t \sim p_t(\cdot | z)$)}
    \State Compute loss $\mathcal{L}(\theta) = \left\lVert s_t^\theta(x_t) + \frac{\epsilon}{\beta_t} \right\rVert^2$ (Generally: = $\left\lVert s_t^\theta(x_t) - \nabla \log p_t(x_t|z)\right\rVert^2$ Alternatively: $\mathcal{L}(\theta) = \left\lVert \epsilon_t^\theta(x_t) - \epsilon \right\rVert^2$)
    \State Update the model parameters $\theta$ via gradient descent on $\mathcal{L}(\theta)$.
\EndFor
\end{algorithmic}
\end{algorithm}
References
, , , .

, , — Aug. 7, 2025