The framework of a diffusion model generally goes as follows: we first fix a sequence of time 0=t0<t1<⋯<tn=T, g(t), μ and σ. Then given each point x0∈Rd sampled from the data distribution p, we can then add Gaussian noise to the data iteratively by sampling xi:=Xti, where Xt follows SDE (1) starting at x0. Since the transition probability is just Gaussian, we can sample (Xn) efficiently.
Assuming T is large enough that xn is approximately distributed according to the stationary Gaussian distribtion. Then the objective will be learning the the information of pt so that we can approximate the time-reversal process (3) with initial data X0∼N(μ,σ2).
Since g(t), μ and σ are known, we only need an estimate of ∇xlogpt to approximate the time-reserval process. The gradient of log density ∇logp:Rd→Rd is known as the score. The idea of learning the score of a data distribution rather than the distribution itself is usually referred to as Score matching.
One major advantage of this idea is that we do not need to normalize our estimate: If we use a neural network f(θ;x) to approximate some density p(x), we need to make sure that ∫f(θ;x)dx=1. Computing such an integral can be challenging or even intractable.
For diffusion models, our training objective will be given by
i=1∑nλiE[∥f(θ;xi,ti)−∇xlogpti(xi)∥2],
where (λi) are positive weights summed up to 1 and determined by the choices of σ and g(t).
The expection on the right-hand side of (5) is particularly nice as we know xi conditioned on x0 is normally distributed with mean and variances given explicitly by (2). Therefore, we can set our loss function to be
In many practical applications of generative models, instead of simply sampling from the data distribution p, we want to generate data given certain properties. For example, in image generation, we want an image that fits a certain text discription.
Mathematically, this means that we want to learn the conditional distribution p(⋅∣L) where ℓ is an additional input. In this case, we can still use our noise injection and denoising but the score ∇xlogpt(x) needs to be replaced with conditional score∇xlogpt(x∣ℓ).
Bayes' rule implies that
∇xlogpt(x∣ℓ)=∇xlogpt(x)+∇xlogpt(ℓ∣x).
Hence, suppose we have a good estimate for the score. It is suffice to learn the information of ∇xlogpt(ℓ∣x). We refer to this method of learning condition score gradient guidance.
Once we have trained the network f(θ;x,t)≈∇xpt(x). We can generate new data by sampling xn from N(μ,σ2) and then simulate either the SDE (3) or the ODE
The two processes give the same marginal distributions for xt. Emperically, the SDE gives samples of better quality while the ODE benefits in faster convergence.