The final goal of flow matching models and diffusion models is to generate some new items, such as images, text, or audio. In this context, “generation” refers to the process of sampling from a distribution.
We use a probabilistic model to represent the distribution of the data, denoted as p(z), where z is the data point. However, we do not have direct access to this distribution. What we have is a set of samples from this distribution, denoted as z1,z2,…,zN, which we call the training set.
So, the generation task includes two main steps:
Learning the Distribution: We need to learn the distribution p(z) from the training set. This is often done by estimating the parameters of a probabilistic model, such as a neural network.
Sampling from the Distribution: Once we have learned the distribution, we can sample from it to generate new data points. This is often done by using a sampling algorithm, such as Markov Chain Monte Carlo (MCMC) or Langevin dynamics.
Let’s take some specific examples to illustrate the goal of generation.
Example 1: Generating Images
Images: z∈RH×W×C, where H is the height, W is the width, and C is the number of channels (e.g., 3 for RGB images).
Example 2: Generating Text
Text: z∈RL, where L is the length of the text sequence.
Example 3: Generating Videos
Videos: z∈RT×H×W×C, where T is the number of time steps, H is the height, W is the width, and C is the number of channels (e.g., 3 for RGB videos).
Application of Flow Matching and Diffusion Models
The most famous application of flow matching and diffusion models is DDPM (Denoising Diffusion Probabilistic Models)[2]. It is a generative model that can generate high-quality images by iteratively refining a noisy image.
Ordinary Differential Equations (ODEs)
First, we need to understand the mathematical framework behind flow matching and diffusion models. The key idea is to model the evolution of a data point over time using differential equations.
Here are some necessary mathematical concepts:
Trajectory: A trajectory is a path traced by a moving point in space. In the context of flow matching and diffusion models, it refers to the path taken by the data point as it evolves over time, denoted as X.
X:[0,T]→Rd
Vector Field: A vector field is a function that assigns a vector to each point in space. In the context of flow matching and diffusion models, it refers to the function that describes the evolution of the data point over time, denoted as u.
u:Rd×[0,T]→Rd
Ordinary differential equations (ODEs) describe the evolution of the data point over time.
An ODE is typically written in the form:
⎩⎪⎨⎪⎧dtdX(t)X(0)=ut(X(t))=x0
where X(t) is the data point at time t, ut(X(t)) is the vector field that describes the evolution of the data point, and x0 is the initial condition.
When ut is continuously differentiable, we can use the Picard-Lindelöf theorem to guarantee the existence and uniqueness of the solution to the ODE. In the practice of machine learning, we implicitly assume that the vector field ut is continuously differentiable.
Flow: A set of solution trajectories of the ODE. The flow is denoted as ψ, which represents the data point at time t given the initial condition x0.
\begin{algorithm}
\caption{Euler Method}
\begin{algorithmic}
\REQUIRE Vector field $u_t$, initial condition $x_0$, number of steps $n$
\STATE Set $t = 0$
\STATE Set step size $h = \frac{1}{n}$
\STATE Set $X_0 = x_0$
\FOR{$i = 1, \ldots, n-1$}
\STATE $X_{t+h} = X_t + h u_t(X_t)$ \COMMENT{Small step forward}
\STATE Update $t \gets t + h$
\ENDFOR
\RETURN $X_0, X_h, X_{2h}, \ldots, X_1$ \COMMENT{Return the trajectory}
\end{algorithmic}
\end{algorithm}
Flow Model
Neural Network of Flow Model:
utθ(x):Rd×[0,T]→Rd
where θ are the parameters of the neural network.
Random init:
X0∼pinit
ODE:
dtdXt=utθ(Xt)
Goal:
XT∼pdata
With the utθ trained, we can use the Euler method to sample from the flow model. The Euler method is a simple numerical method for solving ordinary differential equations (ODEs) by approximating the solution at discrete time steps.
\begin{algorithm}
\caption{Sampling from a Flow Model with Euler method}
\begin{algorithmic}
\Require Neural network vector field $u_{t}^{\theta}$, number of steps $n$
\State Set $t = 0$
\State Set step size $h = \frac{1}{n}$
\State Draw a sample $X_0 \sim p_{\text{init}}$ \Comment{Random initialization!}
\For{$i = 1, \ldots, n-1$}
\State $X_{t+h} = X_t + h u_{t}^{\theta}(X_t)$
\State Update $t \gets t + h$
\EndFor
\State \Return $X_1$ \Comment{Return final point}
\end{algorithmic}
\end{algorithm}
Stochastic Differential Equations (SDEs)
In stochastic process, we talk about random variables that evolve over time. A stochastic process is a collection of random variables indexed by time.
Vector Field:
u:Rd×[0,T]→Rd
Diffusion Coefficient:
σ:[0,T]→Rd
SDE:
{dXtX0=ut(Xt)dt+σtdWt=x0
where Wt is the standard Wiener process (Brownian motion).
Brownian Motion
Stochastic process Wt is called Brownian motion if it satisfies the following properties:
W0=0.
Wt has independent increments: for any 0≤t1<t2<…<tn≤T, the increments Wt2−Wt1,Wt3−Wt2,…,Wtn−Wtn−1 are independent random variables.
Gaussian increments: for any 0≤t1<t2≤T, Wt2−Wt1∼N(0,t2−t1).
Reason of doing this equivalence is that the random variable Wt is not differentiable, so we can not use the derivative to describe the evolution of the random variable. Instead, we use the increment ΔWt=Wt+h−Wt to describe the evolution of the random variable.
Same as ODEs, SDEs have the Picard-Lindelöf theorem to guarantee the existence and uniqueness of the solution to the SDE. In the practice of machine learning, we implicitly assume that the vector field ut and diffusion coefficient σt are continuously differentiable.
\begin{algorithm}
\caption{Sampling from a SDE (Euler-Maruyama method)}
\begin{algorithmic}
\Require Vector field $u_t$, number of steps $n$, diffusion coefficient $\sigma_t$
\State Set $t = 0$
\State Set step size $h = \frac{1}{n}$
\State Set $X_0 = x_0$
\For{$i = 1, \ldots, n - 1$}
\State Draw a sample $\epsilon \sim \mathcal{N}(0, I_d)$
\State $X_{t+h} = X_t + h u_t(X_t) + \sigma_t \sqrt{h} \epsilon$ \COMMENT{Add additional noise with var=h scaled by diffusion coefficient $\sigma_t$}
\State Update $t \leftarrow t + h$
\EndFor
\State \textbf{return} $X_0, X_h, X_{2h}, X_{3h}, \ldots, X_1$
\end{algorithmic}
\end{algorithm}
Compared to flow models, diffusion models use SDE to model the evolution of the data point over time. The key difference is that diffusion models add noise to the data point at each time step, which allows them to generate more diverse samples.
\begin{algorithm}
\caption{Sampling from a Diffusion Model (Euler-Maruyama method)}
\begin{algorithmic}
\Require Neural network $u_t^\theta$, number of steps $n$, diffusion coefficient $\sigma_t$
\State Set $t = 0$
\State Set step size $h = \frac{1}{n}$
\State Draw a sample $X_0 \sim p_{\text{init}}$
\For{$i = 1, \ldots, n - 1$}
\State Draw a sample $\epsilon \sim \mathcal{N}(0, I_d)$
\State $X_{t+h} = X_t + h u_t^\theta(X_t) + \sigma_t \sqrt{h} \epsilon$
\State Update $t \leftarrow t + h$
\EndFor
\State \Return $X_1$
\end{algorithmic}
\end{algorithm}
Training Target of Flow Matching and Diffusion Models
Conditional and Marginal Probability Paths
Key terminology:
“Conditional”: Per single data point
“Marginal”: Across distribution of data points
Dirac Distribution:
z∈Rd→Rd
x∼δz⇔x=z
Conditional Probability Path: pt(⋅∣z)
Example: Conditional Gaussian Probability Path
pt(⋅∣z) is a distribution over Rd for each t∈[0,T] and z∈Rd.
p0(⋅∣z)=pinit, pt(⋅∣z)=δz
pt(⋅∣z)=N(αtz,βt2Id)
where αt,βt are called the noise schedule. They have the following properties:
α0=0,β0=1
αT=1,βT=0
Time:
Marginal Probability Path: z∼data,x∼t(⋅∣z)⇒x∼t(⋅)
⎩⎪⎪⎨⎪⎪⎧pt(x)p0=∫pt(x∣z)pdata(z)dz=pinit
Conditional and Marginal Vector Fields
Conditional Vector Field: uttarget(x∣z)
X0∼pinit,dtdXt=uttarget(Xt∣z)⇒XT∼pt(⋅∣z)
Example: Conditional Gaussian Vector Field
uttarget(x∣z)=(α˙t−βtβ˙t)z+βtβ˙tx
where α˙t,β˙t are the time derivatives of αt,βt.
This equation can be interpreted as a Fokker-Planck equation that describes the evolution of the probability density function pt(x) over time.
Fokker-Planck Equation:
dtdpt(x)=−div(ptut)(x)+2σt2Δpt(x)
The equation indicates that the change in probability density at point x over time is equal to the inflow of probability mass from u plus the heat dispersion term 2σt2Δpt(x).
Train Flow Matching and Diffusion Models
Flow Matching
Model: utθ(x) (θ is the parameter of the neural network)
But this is not tractable, because for each step, we need to calculate the target vector field uttarget(x), which requires looping over all data points z∼pdata. For large datasets, this is computationally expensive.
So the two losses only has a C=C1−C2 gap, which is a constant that does not depend on θ, which means minimizing the conditional flow matching loss is equivalent to minimizing the flow matching loss.
Once utθ(x) is learned, we can use it to generate samples X1 from the data distribution by reversing the flow, via e.g., algorithm 1.
dXt=−utθ(Xt)dt+σtdWt
Example: Flow Matching for Gaussian Conditional Probability Path
Let us return to the example of Gaussian probability paths pt(⋅∣z)=N(αtz,βt2Id), we may sample from the conditional path via:
ϵ∼N(0,Id)⇒xt=αtz+βtϵ∼N(αtz,βt2Id)
The conditional vector field is:
uttarget(x∣z)=(α˙t−βtβ˙tαt)z+βtβ˙tx
Plug in the conditional vector field into the flow matching loss:
The simplicity of Lcfm(θ): We sample a data point z, sample some noise ϵ, and then we take a mean squared error.
Considering a specific case where αt=t and βt=1−t, the corresponding probability path pt(x∣z)=N(tz,(1−t)2Id) is called CondOT Probability Path. So that conditional flow matching loss becomes:
Proof: It is similar to the flow matching loss, where we can simply replace the target vector field uttarget(x) with the target score function ∇logpt(x).
After training the score function stθ(x), we can use it to generate samples from the data distribution using arbitrary diffusion coefficients σt≥0, by simulating the SDE:
In theory, every σt should give samples X1∼pdata at perfect training, but in practice, we encounter two issues:
numerical erros by simulating the SDE
training errors (the model utθ is not exactly equal to uttarget)
At first sight, it seems that we have to learn both utθ and stθ, however, we can often directly regress utθ and stθ in a single network with two outputs. Further as we will see now for the special case of the Gaussian probability path, stθ and utθ may be converted into one another so that we don’t have to train them separately.
Example: Denoising Diffusion Models: Score Matching for Gaussian Probability Paths
The conditional score ∇logpt(x∣z) can be expressed as follows:
∇logpt(x∣z)=−βt2x−αtz
Pluging this formula into the conditional score matching loss, we get:
The network stθ essentially learns to predict the noise ϵ that used to corrupt the data point z. So this process is called denoising score matching.
It was soon realize that the above loss is numerically unstable when βt→0. In some of the first works on denoising diffusion models(like DDPM[2]), it was therefore proposed to use the following loss instead:
The network ϵtθ learns to predict the noise ϵ that used to corrupt the data point z, but it is scaled by −βt.
\begin{algorithm}
\caption{Score Matching Training Procedure for Gaussian probability path}
\begin{algorithmic}
\Require A dataset of samples $z \sim p_{\text{data}}$, score network $s_t^\theta$ or noise predictor $\epsilon_t^\theta$
\For{each mini-batch of data}
\State Sample a data example $z$ from the dataset.
\State Sample a random time $t \sim \text{Unif}[0,1]$.
\State Sample noise $\epsilon \sim \mathcal{N}(0, I_d)$
\State Set $x_t = \alpha_t z + \beta_t \epsilon$
\Comment{(General case: $x_t \sim p_t(\cdot | z)$)}
\State Compute loss $\mathcal{L}(\theta) = \left\lVert s_t^\theta(x_t) + \frac{\epsilon}{\beta_t} \right\rVert^2$ (Generally: = $\left\lVert s_t^\theta(x_t) - \nabla \log p_t(x_t|z)\right\rVert^2$ Alternatively: $\mathcal{L}(\theta) = \left\lVert \epsilon_t^\theta(x_t) - \epsilon \right\rVert^2$)
\State Update the model parameters $\theta$ via gradient descent on $\mathcal{L}(\theta)$.
\EndFor
\end{algorithmic}
\end{algorithm}