Flow Matching and Diffusion Models

1. The Big Picture

The final goal of flow matching models and diffusion models is to generate some new items, such as images, text, or audio. In this context, “generation” refers to the process of sampling from a distribution.

We use a probabilistic model to represent the distribution of the data, denoted as $p(z)$ , where $z$ is the data point. However, we do not have direct access to this distribution. What we have is a set of samples from this distribution, denoted as $z_1, z_2, \ldots, z_N$ , which we call the training set.

So, the generation task includes two main steps:

Learning the Distribution: We need to learn the distribution $p(z)$ from the training set. This is often done by estimating the parameters of a probabilistic model, such as a neural network.
Sampling from the Distribution: Once we have learned the distribution, we can sample from it to generate new data points. This is often done by using a sampling algorithm, such as Markov Chain Monte Carlo (MCMC) or Langevin dynamics.

Let’s take some specific examples to illustrate the goal of generation.

Example 1: Generating Images
Images: $z\in\mathbb{R}^{H\times W\times C}$ , where $H$ is the height, $W$ is the width, and $C$ is the number of channels (e.g., 3 for RGB images).
Example 2: Generating Text
Text: $z\in\mathbb{R}^{L}$ , where $L$ is the length of the text sequence.
Example 3: Generating Videos
Videos: $z\in\mathbb{R}^{T\times H\times W\times C}$ , where $T$ is the number of time steps, $H$ is the height, $W$ is the width, and $C$ is the number of channels (e.g., 3 for RGB videos).

2. Application of Flow Matching and Diffusion Models

The most famous application of flow matching and diffusion models is DDPM (Denoising Diffusion Probabilistic Models)[2]. It is a generative model that can generate high-quality images by iteratively refining a noisy image.

3. Ordinary Differential Equations (ODEs)

First, we need to understand the mathematical framework behind flow matching and diffusion models. The key idea is to model the evolution of a data point over time using differential equations.

Here are some necessary mathematical concepts:

Trajectory: A trajectory is a path traced by a moving point in space. In the context of flow matching and diffusion models, it refers to the path taken by the data point as it evolves over time, denoted as $X$ .

X: [0, T] \to \mathbb{R}^d

Vector Field: A vector field is a function that assigns a vector to each point in space. In the context of flow matching and diffusion models, it refers to the function that describes the evolution of the data point over time, denoted as $u$ .

u: \mathbb{R}^d \times [0, T] \to \mathbb{R}^d

Ordinary differential equations (ODEs) describe the evolution of the data point over time.

An ODE is typically written in the form:

\left\{ \begin{aligned} \frac{\mathrm d}{\mathrm dt}X(t) &= u_t(X(t)) \\ X(0) &= x_0 \end{aligned} \right.

where $X(t)$ is the data point at time $t$ , $u_t(X(t))$ is the vector field that describes the evolution of the data point, and $x_0$ is the initial condition.

When $u_t$ is continuously differentiable, we can use the Picard-Lindelöf theorem to guarantee the existence and uniqueness of the solution to the ODE. In the practice of machine learning, we implicitly assume that the vector field $u_t$ is continuously differentiable.

Flow: A set of solution trajectories of the ODE. The flow is denoted as $\psi$ , which represents the data point at time $t$ given the initial condition $x_0$ .

\psi: [0, T] \times \mathbb{R}^d \to \mathbb{R}^d

3.1. A Simple Example

Simple vector field:

u_t(x) = -\theta x

Flow can be computed as:

\psi(t, x_0) = \exp (-\theta t) x_0

Proof:

Initial condition: $\psi(0, x_0) = x_0$ .
Differential equation:

\frac{\mathrm d}{\mathrm dt} \psi(t, x_0) = -\theta \exp (-\theta t) x_0 = -\theta \psi(t, x_0)= u_t(\psi(t, x_0))

3.2. Euler Method

\begin{algorithm}
\caption{Euler Method}
\begin{algorithmic}
\REQUIRE Vector field $u_t$, initial condition $x_0$, number of steps $n$
\STATE Set $t = 0$
\STATE Set step size $h = \frac{1}{n}$
\STATE Set $X_0 = x_0$
\FOR{$i = 1, \ldots, n-1$}
    \STATE $X_{t+h} = X_t + h u_t(X_t)$  \COMMENT{Small step forward}
    \STATE Update $t \gets t + h$
\ENDFOR
\RETURN $X_0, X_h, X_{2h}, \ldots, X_1$ \COMMENT{Return the trajectory}
\end{algorithmic}
\end{algorithm}

4. Flow Model

Neural Network of Flow Model:

u_t^\theta(x): \mathbb{R}^d\times [0, T] \to \mathbb{R}^d

where $\theta$ are the parameters of the neural network.

Random init:

X_0 \sim p_{\text{init}}

ODE:

\frac{\mathrm d}{\mathrm dt}X_t = u_t^\theta(X_t)

Goal:

X_T \sim p_{\text{data}}

With the $u_t^\theta$ trained, we can use the Euler method to sample from the flow model. The Euler method is a simple numerical method for solving ordinary differential equations (ODEs) by approximating the solution at discrete time steps.

\begin{algorithm}
\caption{Sampling from a Flow Model with Euler method}
\begin{algorithmic}
\Require Neural network vector field $u_{t}^{\theta}$, number of steps $n$
\State Set $t = 0$
\State Set step size $h = \frac{1}{n}$
\State Draw a sample $X_0 \sim p_{\text{init}}$ \Comment{Random initialization!}
\For{$i = 1, \ldots, n-1$}
    \State $X_{t+h} = X_t + h u_{t}^{\theta}(X_t)$
    \State Update $t \gets t + h$
\EndFor
\State \Return $X_1$ \Comment{Return final point}
\end{algorithmic}
\end{algorithm}

5. Stochastic Differential Equations (SDEs)

In stochastic process, we talk about random variables that evolve over time. A stochastic process is a collection of random variables indexed by time.

Vector Field:

u: \mathbb{R}^d\times [0, T]\to \mathbb{R}^d

Diffusion Coefficient:

\sigma: [0, T] \to \mathbb{R}^d

SDE:

\left\{ \begin{aligned} \mathrm d X_t &= u_t(X_t)\mathrm dt + \sigma_t \mathrm dW_t \\ X_0 &= x_0 \end{aligned} \right.

where $W_t$ is the standard Wiener process (Brownian motion).

Brownian Motion

Stochastic process $W_t$ is called Brownian motion if it satisfies the following properties:

$W_0 = 0$ .

$W_t$ has independent increments: for any $0 \leq t_1 < t_2 < \ldots < t_n \leq T$ , the increments $W_{t_2} - W_{t_1}, W_{t_3} - W_{t_2}, \ldots, W_{t_n} - W_{t_{n-1}}$ are independent random variables.

Gaussian increments: for any $0 \leq t_1 < t_2 \leq T$ , $W_{t_2} - W_{t_1} \sim \mathcal{N}(0, t_2 - t_1)$ .

When in ODE, we can do following equivalence:

\frac{\mathrm d}{\mathrm dt}X_t = u_t(X_t) \Leftrightarrow X_{t+h} = X_t + h u_t(X_t) + o(h) (\lim_{h\to 0} o(h) = 0)

When in SDE:

\mathrm dX_t = u_t(X_t)\mathrm dt + \sigma_t \mathrm dW_t \Leftrightarrow X_{t+h} = X_t + h u_t(X_t) + \sigma_t \Delta W_t + o(h) (\lim_{h\to 0} o(h) = 0)

Reason of doing this equivalence is that the random variable $W_t$ is not differentiable, so we can not use the derivative to describe the evolution of the random variable. Instead, we use the increment $\Delta W_t = W_{t+h} - W_t$ to describe the evolution of the random variable.

Same as ODEs, SDEs have the Picard-Lindelöf theorem to guarantee the existence and uniqueness of the solution to the SDE. In the practice of machine learning, we implicitly assume that the vector field $u_t$ and diffusion coefficient $\sigma_t$ are continuously differentiable.

\begin{algorithm}
\caption{Sampling from a SDE (Euler-Maruyama method)}
\begin{algorithmic}
\Require Vector field $u_t$, number of steps $n$, diffusion coefficient $\sigma_t$
\State Set $t = 0$
\State Set step size $h = \frac{1}{n}$
\State Set $X_0 = x_0$
\For{$i = 1, \ldots, n - 1$}
    \State Draw a sample $\epsilon \sim \mathcal{N}(0, I_d)$
    \State $X_{t+h} = X_t + h u_t(X_t) + \sigma_t \sqrt{h} \epsilon$  \COMMENT{Add additional noise with var=h scaled by diffusion coefficient $\sigma_t$}
    \State Update $t \leftarrow t + h$
\EndFor
\State \textbf{return} $X_0, X_h, X_{2h}, X_{3h}, \ldots, X_1$
\end{algorithmic}
\end{algorithm}

Using $\sqrt{h}$ is because $(W_{t+h} - W_t) \sim \mathcal{N}(0, h)$ .

We can visualize the simple example in 2.1.

θ (drift parameter):

1.0

σ (diffusion coefficient):

0.5

x₀ (initial condition):

1.0

Number of trajectories:

Reset View

Deterministic solution

Stochastic trajectories

6. Diffusion Model

Compared to flow models, diffusion models use SDE to model the evolution of the data point over time. The key difference is that diffusion models add noise to the data point at each time step, which allows them to generate more diverse samples.

\begin{algorithm}
\caption{Sampling from a Diffusion Model (Euler-Maruyama method)}
\begin{algorithmic}
\Require Neural network $u_t^\theta$, number of steps $n$, diffusion coefficient $\sigma_t$
\State Set $t = 0$
\State Set step size $h = \frac{1}{n}$
\State Draw a sample $X_0 \sim p_{\text{init}}$
\For{$i = 1, \ldots, n - 1$}
    \State Draw a sample $\epsilon \sim \mathcal{N}(0, I_d)$
    \State $X_{t+h} = X_t + h u_t^\theta(X_t) + \sigma_t \sqrt{h} \epsilon$
    \State Update $t \leftarrow t + h$
\EndFor
\State \Return $X_1$
\end{algorithmic}
\end{algorithm}

7. Training Target of Flow Matching and Diffusion Models

7.1. Conditional and Marginal Probability Paths

Key terminology:

“Conditional”: Per single data point
“Marginal”: Across distribution of data points

Dirac Distribution:

z\in\mathbb{R}^d\to\mathbb{R}^d

x\sim \delta_z \Leftrightarrow x = z

Conditional Probability Path: $p_t(\cdot|z)$

Example: Conditional Gaussian Probability Path

$p_t(\cdot|z)$ is a distribution over $\mathbb{R}^d$ for each $t\in[0, T]$ and $z\in\mathbb{R}^d$ .

$p_0(\cdot|z) = p_{\text{init}}$ , $p_t(\cdot|z)=\delta_z$

$p_t(\cdot|z) = \mathcal{N}(\alpha_t z, \beta_t^2 \mathbf{I}_d)$
where $\alpha_t, \beta_t$ are called the noise schedule. They have the following properties:

$\alpha_0 = 0, \beta_0 = 1$

$\alpha_T = 1, \beta_T = 0$

Time:

t =

0.00

α_t =

0.00

β_t =

1.00

Show Transforming Grid Map

Marginal Probability Path: $z\sim _{\text{data}}, x\sim _t(\cdot|z) \Rightarrow x\sim _t(\cdot)$

\left\{ \begin{aligned} p_t(x)& = \int p_t(x|z) p_{\text{data}}(z) \mathrm dz\\ p_0 &= p_{\text{init}} \end{aligned} \right.

7.2. Conditional and Marginal Vector Fields

Conditional Vector Field: $u_t^{\text{target}}(x|z)$

X_0\sim p_{\text{init}}, \frac{\mathrm d}{\mathrm dt} X_t = u_t^{\text{target}}(X_t|z)\Rightarrow X_T\sim p_t(\cdot|z)

Example: Conditional Gaussian Vector Field
$u_t^{\text{target}}(x|z) = (\dot\alpha_t - \frac{\dot\beta_t}{\beta_t})z+\frac{\dot\beta_t}{\beta_t}x$
where $\dot\alpha_t, \dot\beta_t$ are the time derivatives of $\alpha_t, \beta_t$ .

Marginal Vector Field:

u_t^{\text{target}}(x) = \int u_t^{\text{target}}(x|z)\frac{p_t(x|z)p_{\text{data}}(z)}{p_t(x)} \mathrm dz

satisfies:

X_0\sim p_{\text{init}}, \frac{\mathrm d}{\mathrm dt} X_t = u_t^{\text{target}}(X_t)\Rightarrow X_t\sim P_t(\cdot)\Rightarrow X_T\sim p_{\text{data}}

Continuity Equation
$\text{Follow probability path:} X_t\sim p_t \Leftrightarrow \frac{\mathrm d}{\mathrm d t}p_t(x) = -\text{div}(p_t u_t)(x)$
The equation iondicates that the change in probability density at point $x$ over time is equal to the inflow of probablity mass from $u$ .

Proof of the formula of the marginal vector field:

\begin{aligned} \frac{\mathrm d}{\mathrm dt}p_t(x) &= \int \frac{\mathrm d}{\mathrm dt}p_t(x|z)p_{\text{data}}(z) \mathrm dz \\ &= \int -\text{div}(p_t(x|z)u_t^{\text{target}}(x|z))p_{\text{data}}(z) \mathrm dz \\ &= -\text{div}(\int p_t(x|z)u_t^{\text{target}}(x|z)p_{\text{data}}(z) \mathrm dz) \\ &= -\text{div}(p_t(x)u_t^{\text{target}}(x)) \\ &= -\text{div}(p_t(x)\int u_t^{\text{target}}(x|z)\frac{p_t(x|z)p_{\text{data}}(z)}{p_t(x)} \mathrm dz) \\ &= -\text{div}(p_t(x)u_t^{\text{target}}(x)) \end{aligned}

7.3. Conditional and Marginal Score Function

Conditional Score Function: $\nabla_x\log p_t(x|z)$

Example: Conditional Gaussian Score Function
$\nabla_x\log p_t(x|z) = \nabla_x\log \exp\left(-\frac{1}{2\beta_t^2}(x-\alpha_t z)^2\right) = -\frac{x-\alpha_t z}{\beta_t^2}$

Marginal Score Function: $\nabla_x\log p_t(x)$

\nabla_x\log p_t(x) = \frac{\nabla_x p_t(x)}{p_t(x)} = \frac{\int \nabla_x p_t(x|z)p_{\text{data}}(z) \mathrm dz}{p_t(x)} = \int \nabla_x\log p_t(x|z)\frac{p_t(x|z)p_{\text{data}}(z)}{p_t(x)} \mathrm dz

7.4. Summary of Key Concepts

Conditional Prob. Path, Vector Field, and Score Function:

	Notation	Key property	Gaussian example
Conditional Probability Path	$p_t(\cdot\|z)$	Interpolates $p_{\text{init}}$ and a data point $z$	$\mathcal{N}(\alpha_t z, \beta_t^2 I_d)$
Conditional Vector Field	$u_t^{\text{target}}(x\|z)$	ODE follows conditional path	$\left( \dot{\alpha}_t - \frac{\dot{\beta}_t}{\beta_t} \alpha_t \right) z + \frac{\dot{\beta}_t}{\beta_t} x$
Conditional Score Function	$\nabla \log p_t(x\|z)$	Gradient of log-likelihood	$\frac{x - \alpha_t z}{\beta_t^2}$

Marginal Prob. Path, Vector Field, Score Function:

Description	Notation	Key property	Formula
Marginal Probability Path	$p_t$	Interpolates $p_{\text{init}}$ and $p_{\text{data}}$	$\int p_t(x \mid z) p_{\text{data}}(z) \, \mathrm dz$
Marginal Vector Field	$u_t^{\text{target}}(x)$	ODE follows marginal path	$\int u_t^{\text{target}}(x \mid z) \frac{p_t(x \mid z) p_{\text{data}}(z)}{p_t(x)} \, \mathrm dz$
Marginal Score Function	$\nabla \log p_t(x)$	Can be used to convert ODE target to SDE	$\int \nabla \log p_t(x \mid z) \frac{p_t(x \mid z) p_{\text{data}}(z)}{p_t(x)} \, \mathrm dz$

This video demonstrates the evolution of distribution from original noise distribution to the target distribution.

7.5. From ODEs to SDEs

For any $\sigma_t\geq 0$ , we can convert the ODEs to SDEs by adding a noise term:

X_0\sim p_{\text{init}}, \mathrm dX_t = [u_t^{\text{target}}(X_t) + \frac{\sigma_t^2}{2} \nabla \log p_t(X_t)] \mathrm dt + \sigma_t \mathrm dW_t

where $dW_t$ is a Wiener process.

This equation can be interpreted as a Fokker-Planck equation that describes the evolution of the probability density function $p_t(x)$ over time.

Fokker-Planck Equation:
$\frac{\mathrm d p_t(x)}{\mathrm d t} = -\text{div}(p_t u_t)(x) + \frac{\sigma_t^2}{2} \Delta p_t(x)$
The equation indicates that the change in probability density at point $x$ over time is equal to the inflow of probability mass from $u$ plus the heat dispersion term $\frac{\sigma_t^2}{2} \Delta p_t(x)$ .

8. Train Flow Matching and Diffusion Models

8.1. Flow Matching

Model: $u_t^\theta(x)$ ( $\theta$ is the parameter of the neural network)

Goal: $u_t^\theta(x) \approx u_t^{\text{target}}(x)$

Flow Matching Loss:

\mathcal{L}_{\text{fm}}(\theta) = \mathbb{E}_{t\sim\text{Unif}(0, T), z\sim p_{\text{data}}, x\sim p_t(\cdot|z)}[\|u_t^\theta(x) - u_t^{\text{target}}(x)\|^2]

But this is not tractable, because for each step, we need to calculate the target vector field $u_t^{\text{target}}(x)$ , which requires looping over all data points $z\sim p_{\text{data}}$ . For large datasets, this is computationally expensive.

Conditional Flow Matching Loss:

\mathcal{L}_{\text{cfm}}(\theta) = \mathbb{E}_{t\sim\text{Unif}(0, T), z\sim p_{\text{data}}, x\sim p_t(\cdot|z)}[\|u_t^\theta(x) - u_t^{\text{target}}(x|z)\|^2]

Minimizing the conditional flow matching loss is surrogate to minimizing the flow matching loss, because:

\begin{aligned} \mathcal{L}_{\text{cfm}}(\theta) &= \mathcal{L}_{\text{fm}}(\theta) + C \\ \end{aligned}

where $C$ is a constant that does not depend on $\theta$ .

Proof: The proof works by expanding the mean-squared error into three parts using the following identity:

|| a - b||^2 = ||a||^2 - 2 a^T b + ||b||^2

\begin{aligned} \mathcal{L}_{\text{fm}}(\theta) &= \mathbb{E}_{t \sim \text{Unif}, x \sim p_t} \left[ \| u_t^{\theta}(x) - u_t^{\text{target}}(x) \|^2 \right]\\ &= \mathbb{E}_{t \sim \text{Unif}, x \sim p_t} \left[ \| u_t^{\theta}(x) \|^2 - 2u_t^{\theta}(x)^T u_t^{\text{target}}(x) + \| u_t^{\text{target}}(x) \|^2 \right] \\ &= \mathbb{E}_{t \sim \text{Unif}, x \sim p_t} \left[ \| u_t^{\theta}(x) \|^2 \right] - 2 \mathbb{E}_{t \sim \text{Unif}, x \sim p_t} \left[ u_t^{\theta}(x)^T u_t^{\text{target}}(x) \right] \\ &\quad + \underbrace{\mathbb{E}_{t \sim \text{Unif}, x \sim p_t} \left[ \| u_t^{\text{target}}(x) \|^2 \right]}_{=: C_1}\\ &= \mathbb{E}_{t \sim \text{Unif}, z \sim p_{\text{data}}, x \sim p_t(\cdot | z)} \left[ \| u_t^{\theta}(x) \|^2 \right] - 2 \mathbb{E}_{t \sim \text{Unif}, x \sim p_t} \left[ u_t^{\theta}(x)^T u_t^{\text{target}}(x) \right] + C_1 \end{aligned}

\begin{aligned} \mathcal{L}_{\text{cfm}}(\theta) &= \mathbb{E}_{t \sim \text{Unif}, z \sim p_{\text{data}}, x \sim p_t(\cdot | z)} \left[ \| u_t^{\theta}(x) - u_t^{\text{target}}(x|z) \|^2 \right] \\ &= \mathbb{E}_{t \sim \text{Unif}, z \sim p_{\text{data}}, x \sim p_t(\cdot | z)} \left[ \| u_t^{\theta}(x) \|^2 - 2 u_t^{\theta}(x)^T u_t^{\text{target}}(x|z) + \| u_t^{\text{target}}(x|z) \|^2 \right] \\ &= \mathbb{E}_{t \sim \text{Unif}, z \sim p_{\text{data}}, x \sim p_t(\cdot | z)} \left[ \| u_t^{\theta}(x) \|^2 \right] - 2 \mathbb{E}_{t \sim \text{Unif}, z \sim p_{\text{data}}, x \sim p_t(\cdot | z)} \left[ u_t^{\theta}(x)^T u_t^{\text{target}}(x|z) \right] \\ &\quad + \underbrace{\mathbb{E}_{t \sim \text{Unif}, z \sim p_{\text{data}}, x \sim p_t(\cdot | z)} \left[ \| u_t^{\text{target}}(x|z) \|^2 \right]}_{=: C_2} \\ &= \mathbb{E}_{t \sim \text{Unif}, z \sim p_{\text{data}}, x \sim p_t(\cdot | z)} \left[ \| u_t^{\theta}(x) \|^2 \right] - 2 \mathbb{E}_{t \sim \text{Unif}, x \sim p_t} \left[ u_t^{\theta}(x)^T u_t^{\text{target}}(x) \right] + C_2 \end{aligned}

What we need to prove is the relationship between $\mathbb{E}_{t \sim \text{Unif}, x \sim p_t} \left[ u_t^{\theta}(x)^T u_t^{\text{target}}(x) \right]$ and $\mathbb{E}_{t \sim \text{Unif}, z \sim p_{\text{data}}, x \sim p_t(\cdot | z)} \left[ u_t^{\theta}(x)^T u_t^{\text{target}}(x|z) \right]$ .

\begin{aligned} \mathbb{E}_{t\sim\text{Unif}, x \sim p_t}\left[u^{\theta}_t(x)^T u^{\text{target}}_t(x)\right] &= \int_0^1 \int p_t(x) u^{\theta}_t(x)^T u^{\text{target}}_t(x) \, dx \, dt \\ &= \int_0^1 \int p_t(x) u^{\theta}_t(x)^T \left[\int u^{\text{target}}_t(x|z) \frac{p_t(x|z) p_{\text{data}}(z)}{p_t(x)} \, dz \right] \, dx \, dt \\ &= \int_0^1 \int \int u^{\theta}_t(x)^T u^{\text{target}}_t(x|z) p_t(x|z) p_{\text{data}}(z) \, dz \, dx \, dt \\ &= \mathbb{E}_{t\sim\text{Unif}, z\sim p_{\text{data}}, x \sim p_t(\cdot|z)}\left[u^{\theta}_t(x)^T u^{\text{target}}_t(x|z)\right] \end{aligned}

So the two losses only has a $C = C_1 - C_2$ gap, which is a constant that does not depend on $\theta$ , which means minimizing the conditional flow matching loss is equivalent to minimizing the flow matching loss.

Once $u_t^{\theta}(x)$ is learned, we can use it to generate samples $X_1$ from the data distribution by reversing the flow, via e.g., algorithm 1.

\mathrm dX_t = -u_t^{\theta}(X_t) \mathrm dt + \sigma_t \mathrm dW_t

Example: Flow Matching for Gaussian Conditional Probability Path
Let us return to the example of Gaussian probability paths $p_t(\cdot|z) = \mathcal{N}(\alpha_t z, \beta_t^2 I_d)$ , we may sample from the conditional path via:
$\epsilon \sim \mathcal{N}(0, I_d)\Rightarrow x_t = \alpha_t z + \beta_t \epsilon \sim \mathcal{N}(\alpha_t z, \beta_t^2 I_d)$
The conditional vector field is:
$u_t^{\text{target}}(x|z) = \left(\dot\alpha_t - \frac{\dot\beta_t}{\beta_t}\alpha_t\right) z + \frac{\dot\beta_t}{\beta_t} x$
Plug in the conditional vector field into the flow matching loss:
$\mathcal{L}_{\text{cfm}}(\theta) = \mathbb{E}_{t\sim\text{Unif}(0, T), z\sim p_{\text{data}}, x\sim p_t(\cdot|z)}\left[\left\|u_t^\theta(\alpha_t z + \beta_t \epsilon) - (\dot\alpha_tz+\dot\beta_t\epsilon)\right\|^2\right]$
The simplicity of $\mathcal{L}_{\text{cfm}}(\theta)$ : We sample a data point $z$ , sample some noise $\epsilon$ , and then we take a mean squared error.

Considering a specific case where $\alpha_t = t$ and $\beta_t = 1-t$ , the corresponding probability path $p_t(x|z)=\mathcal{N}(tz, (1-t)^2 I_d)$ is called CondOT Probability Path. So that conditional flow matching loss becomes:
$\mathcal{L}_{\text{cfm}}(\theta) = \mathbb{E}_{t\sim\text{Unif}(0, T), z\sim p_{\text{data}}, \epsilon\sim\mathcal{N}(0, I_d)}\left[\left\|u_t^\theta(tz + (1-t)\epsilon) - (z-\epsilon)\right\|^2\right]$

\begin{algorithm}
\caption{Flow Matching Training Procedure (here for Gaussian CondOT path $p_t(x|z) = \mathcal{N}(tz, (1-t)^2)$)}
\begin{algorithmic}
\Require A dataset of samples $z \sim p_{\text{data}}$, neural network $u_t^\theta$
\For {each mini-batch of data}
    \State Sample a data example $z$ from the dataset.
    \State Sample a random time $t \sim \text{Unif}[0,1]$.
    \State Sample noise $\epsilon \sim \mathcal{N}(0, I_d)$ 
    \Comment{(General case: $x \sim p_t(\cdot|z)$)}
    \State Set $x = tz + (1-t)\epsilon$
    \State Compute loss $\mathcal{L}(\theta) = \|u_t^\theta(x) - (z - \epsilon)\|^2$ 
    \Comment{General case: $\mathcal{L}(\theta) = \|u_t^\theta(x) - u_t^{\text{target}}(x|z)\|^2$}
\State Update the model parameters $\theta$ via gradient descent on $\mathcal{L}(\theta)$.

\EndFor

\end{algorithmic}

\end{algorithm}

8.2. Score Matching

Extending the algorithm from ODEs to SDEs, we got the score matching.

\mathrm dX_t = [u_t^{\text{target}}(X_t) + \frac{\sigma_t^2}{2}\nabla \log p_t(X_t)] \mathrm dt + \sigma_t \mathrm dW_t

where $u_t^{\text{target}}$ is the marginal vector field, and $\nabla \log p_t(X_t)$ is the marginal score function, represented by the formula:

\nabla \log p_t(x) = \int \nabla \log p_t(x|z) \frac{p_t(X_t|z)p_{\text{data}}(z)}{p_t(x)} \mathrm dz

We can define the score matching loss as:

\mathcal{L}_{\text{sm}}(\theta) = \mathbb{E}_{t\sim\text{Unif}, z\sim p_{\text{data}}, x\sim p_t(\cdot|z)}\left[\left\|s_t^\theta(x) - \nabla \log p_t(x)\right\|^2\right]

Similar to the flow matching loss, we can use the conditional score function to approximate the marginal score function:

\mathcal{L}_{\text{csm}}(\theta) = \mathbb{E}_{t\sim\text{Unif}, z\sim p_{\text{data}}, x\sim p_t(\cdot|z)}\left[\left\|s_t^\theta(x) - \nabla \log p_t(x|z)\right\|^2\right]

Proof: It is similar to the flow matching loss, where we can simply replace the target vector field $u_t^{\text{target}}(x)$ with the target score function $\nabla \log p_t(x)$ .

After training the score function $s_t^\theta(x)$ , we can use it to generate samples from the data distribution using arbitrary diffusion coefficients $\sigma_t\geq 0$ , by simulating the SDE:

X_0\sim p_{\text{init}}, \mathrm dX_t = [u_t^\theta(X_t) + \frac{\sigma_t^2}{2}s_t^\theta(X_t)] \mathrm dt + \sigma_t \mathrm dW_t

In theory, every $\sigma_t$ should give samples $X_1\sim p_{\text{data}}$ at perfect training, but in practice, we encounter two issues:

numerical erros by simulating the SDE
training errors (the model $u_t^\theta$ is not exactly equal to $u_t^\text{target}$ )

At first sight, it seems that we have to learn both $u_t^\theta$ and $s_t^\theta$ , however, we can often directly regress $u_t^\theta$ and $s_t^\theta$ in a single network with two outputs. Further as we will see now for the special case of the Gaussian probability path, $s_t^\theta$ and $u_t^\theta$ may be converted into one another so that we don’t have to train them separately.

Example: Denoising Diffusion Models: Score Matching for Gaussian Probability Paths
The conditional score $\nabla \log p_t(x|z)$ can be expressed as follows:
$\nabla \log p_t(x|z) = -\frac{x-\alpha_t z}{\beta_t^2}$
Pluging this formula into the conditional score matching loss, we get:
$\begin{aligned} \mathcal{L}_{\text{csm}}(\theta) &= \mathbb{E}_{t\sim\text{Unif}, z\sim p_{\text{data}}, x\sim p_t(\cdot|z)}\left[\left\|s_t^\theta(x) + \frac{x-\alpha_t z}{\beta_t^2}\right\|^2\right]\\ &= \mathbb{E}_{t\sim\text{Unif}, z\sim p_{\text{data}}, x\sim p_t(\cdot|z)}\left[\left\|s_t^\theta(\alpha_t z + \beta_t \epsilon) + \frac{\beta_t \epsilon}{\beta_t^2}\right\|^2\right]\\ &= \mathbb{E}_{t\sim\text{Unif}, z\sim p_{\text{data}}, \epsilon\sim\mathcal{N}(0, I_d)}\left[\frac{1}{\beta_t^2}\left\|s_t^\theta(\alpha_t z + \beta_t \epsilon) + \epsilon\right\|^2\right]\\ \end{aligned}$
The network $s_t^\theta$ essentially learns to predict the noise $\epsilon$ that used to corrupt the data point $z$ . So this process is called denoising score matching.

It was soon realize that the above loss is numerically unstable when $\beta_t\to 0$ . In some of the first works on denoising diffusion models(like DDPM[2]), it was therefore proposed to use the following loss instead:
$-\beta_ts_t^\theta(x) = \epsilon_t^\theta(x) \Rightarrow \mathcal{L}_{\text{DDPM}}(\theta) = \mathbb{E}_{t\sim\text{Unif}, z\sim p_{\text{data}}, x\sim p_t(\cdot|z)}\left[\left\|\epsilon_t^\theta(\alpha_t z + \beta_t \epsilon) - \epsilon\right\|^2\right]$
The network $\epsilon_t^\theta$ learns to predict the noise $\epsilon$ that used to corrupt the data point $z$ , but it is scaled by $-\beta_t$ .

\begin{algorithm}
\caption{Score Matching Training Procedure for Gaussian probability path}
\begin{algorithmic}
\Require A dataset of samples $z \sim p_{\text{data}}$, score network $s_t^\theta$ or noise predictor $\epsilon_t^\theta$
\For{each mini-batch of data}
    \State Sample a data example $z$ from the dataset.
    \State Sample a random time $t \sim \text{Unif}[0,1]$.
    \State Sample noise $\epsilon \sim \mathcal{N}(0, I_d)$
    \State Set $x_t = \alpha_t z + \beta_t \epsilon$ 
    \Comment{(General case: $x_t \sim p_t(\cdot | z)$)}
    \State Compute loss $\mathcal{L}(\theta) = \left\lVert s_t^\theta(x_t) + \frac{\epsilon}{\beta_t} \right\rVert^2$ (Generally: = $\left\lVert s_t^\theta(x_t) - \nabla \log p_t(x_t|z)\right\rVert^2$ 
    Alternatively: $\mathcal{L}(\theta) = \left\lVert \epsilon_t^\theta(x_t) - \epsilon \right\rVert^2$)
    \State Update the model parameters $\theta$ via gradient descent on $\mathcal{L}(\theta)$.
\EndFor
\end{algorithmic}
\end{algorithm}

References

1. An introduction to flow matching and diffusion models, Peter Holderrieth, Ezra Erives, 2025.

2. Denoising diffusion probabilistic models, Jonathan Ho, Ajay Jain, Pieter Abbeel, 6 2020.

MicDZ's Blog