
“Understanding Diffusion Models: A Unified Perspective” + “Variational Diffusion” from Kingma

The easiest way to think of a Variational Diffusion Model (VDM) is simply as a Markovian Hierarchical Variational Autoencoder with three key restrictions:
 The latent dimension is exactly equal to the data dimension
 This means that $q(z_{t+1}∣z_{t})=q(x_{t+1}∣x_{t})$. Everything stays in $x$ space.
 The structure of the latent encoder at each timestep is not learned
 It is predefined as a linear Gaussian model. In other words, it is a Gaussian distribution centered around the output of the previous timestep
 The Gaussian parameters of the latent encoders vary over time in such a way that the distribution of the latent at final timestep $T$ is a standard Gaussian
 The latent dimension is exactly equal to the data dimension

Diagram
ELBO derivation
The variational lower bound loss is derived from $E_{q(x_{1:T}∣x_{0})}[logq(x_{1:T}∣x_{0})p(x_{0:T}) ]$ where we take advantage of the fact that both $p(x_{0:T})$ and $q(x_{1:T}∣x_{0})$ are Markovian, and thus the logproduct can be easily decomposed into incremental steps.
 We obtain $logp(x)≥E_{q(x_{1}∣x_{0})}[logp_{θ}(x_{0}∣x_{1})]−∑_{t=2}E_{q(x_{t}∣x_{0})}[D_{KL}(q(x_{t−1}∣x_{t},x_{0})∣∣p_{θ}(x_{t−1}∣x_{t}))]−D_{KL}(q(x_{T}∣x_{0})∣∣p(x_{T}))$
 Reconstruction term
 where the first term is the reconstruction term, like its analogue in the ELBO of vanilla VAE.
 This term can be approximated and optimized using a Monte Carlo estimate
 where the first term is the reconstruction term, like its analogue in the ELBO of vanilla VAE.
 Denoising matching terms
 the $T−2$ terms are denoising matching terms.
 We learn desired denoising transition step $p_{θ}(x_{t−1}∣x_{t})$ as an approximation to tractable, ground truth denoising transition step $q(x_{t−1}∣x_{t},x_{0})$ (which closed form is derived in The diffusion process)
 Detail: when originally deriving the ELBO, you get “consistency terms”
 i.e. $E_{q(x_{t−1},x_{t+1}∣x_{0})}[D_{KL}(q(x_{t}∣x_{t−1})∣∣p_{θ}(x_{t}∣x_{t+1}))]$, where a denoising step from a noisier image should match the corresponding noising step from a cleaner image.
 However, actually optimizing the ELBO using the terms we just derived might be suboptimal; because the consistency term is computed as an expectation over two random variables ${x_{t−1},x_{t+1}}$ for every timestep, the variance of its Monte Carlo estimate could potentially be higher than a term that is estimated using only one random variable per timestep.
 The trick to obtain denoising matching terms is to rewrite the encoder transition as $q(x_{t}∣x_{t−1})=q(x_{t}∣x_{t−1},x_{0})$, where the extra conditioning is superfluous due to the Markov property and then use Bayes rule to rewrite each transition as $q(x_{t}∣x_{t−1},x_{0})=q(x_{t−1}∣x_{0})q(x_{t−1}∣x_{t},x_{0})q(x_{t}∣x_{0}) $
 the $T−2$ terms are denoising matching terms.
 Prior matching term
 The last form enforces isotropic Gaussian at the end of the diffusion, it’s never optimized.
Optimizing in practice
Maximizing ELBO, equivalent to denoising (when variance schedule is fixed)

Remember that using Bayes theorem, one can calculate the posterior $q(x_{t−1}∣x_{t},x_{0})$ in terms of $β~ _{t}$ and $μ~ (x_{t},x_{0})$ which are defined as follows:
 $β~ _{t}=1−αˉ_{t}1−αˉ_{t−1} β_{t}$ (posterior variance schedule)
 $μ~ (x_{t},x_{0})=1−αˉ_{t}αˉ_{t−1} β_{t} x_{0}+1−αˉ_{t}αˉ_{t} (1−αˉ_{t−1}) x_{t}$
 $q(x_{t−1}∣x_{t},x_{0})=N(x_{t−1};μ~ (x_{t},x_{0}),β~ _{t}I)$

The KLdivergence between two Gaussian disitributions is composed of an MSE of the two means (divided by some variance term) + some terms about the variances

In the case, where we fix the variance schedule $β_{t}$,
 we can match exactly the two distributions variances, and thus minimizing the KLdivergence is exactly equivalent to minimize the MSE between the two means ⇒ denoising objective: $θargmin D_{KL}(q(x_{t−1}∣x_{t},x_{0})∣∣p_{θ}(x_{t−1}∣x_{t}))=θargmin 2β~ _{t}1 ∣∣μ_{θ}(x_{t},t)−μ~ (x_{t},x_{0})∣∣_{2}$
 By setting $μ_{θ}(x_{t},t)=1−αˉ_{t}αˉ_{t−1} β_{t} x^_{θ}(x_{t},t)+1−αˉ_{t}αˉ_{t} (1−αˉ_{t−1}) x_{t}$ (similar parametrization as $μ~ (x_{t},x_{0})$), we recover: $θargmin 2β~ _{t}1 (1−αˉ_{t})_{2}αˉ_{t−1}(1−α_{t})_{2} ∣∣x^_{θ}(x_{t},t)−x_{0}∣∣_{2}$
 Optimizing a VDM boils down to learning a neural network to predict the original ground truth image from an arbitrarily noisified version of it
 Just do $θargmin E_{t∼U{2,T}}[denoising term at timet]$
Learning Diffusion Noise Parameters (variance schedule is not fixed)
 Above, we wrote our pertimestep objective as $2β^ _{t}1 (1−αˉ_{t})_{2}αˉ_{t−1}(1−α_{t})_{2} ∣∣x^_{θ}(x_{t},t)−x_{0}∣∣_{2}$
 By plugging in $β~ _{t}=1−αˉ_{t}1−αˉ_{t−1} β_{t}=1−αˉ_{t}1−αˉ_{t−1}(1−α_{t}) $ in the above equation, we get $21 (1−αˉ_{t−1}αˉ_{t−1} −1−αˉ_{t}αˉ_{t} )∣∣x^_{θ}(x_{t},t)−x_{0}∣∣_{2}=21 (SNR(t−1)−SNR(t))∣∣x^_{θ}(x_{t},t)−x_{0}∣∣_{2}$
 Recall that $q(x_{t}∣x_{0})=N(x_{t};αˉ_{t} x_{0},(1−α_{t}ˉ )I)$.
 Then following the definition of the signaltonoise ratio (SNR) as $SNR=σ_{2}μ_{2} $, we can write $SNR(t)=1−αˉ_{t}αˉ_{t} $
 . In a diffusion model, we require the SNR to monotonically decrease as timestep $t$ increases.
Parametrizing the SNR
 We can directly parameterize the SNR at each timestep using a neural network, and learn it jointly along with the diffusion model.
 As the SNR must monotonically decrease over time, we can represent it as:
 $SNR_{ν}(t)=exp(−w_{ν}(t))$
 where $w_{ν}(t)$ is modeled as a monotonically increasing neural network with parameters $ν$.
 We can write elegant forms for the $α$‘s
 $1−αˉ_{t}αˉ_{t} =exp(−w_{ν}(t))$
 $αˉ_{t}=sigmoid(−w_{ν}(t)$
 $1−αˉ_{t}=sigmoid(w_{ν}(t)$
Optimizing everything
 We now optimize $θ,νargmin E_{t∼U{2,T}}[E_{q(x_{t}∣x_{0})}[D_{KL}(q(x_{t−1}∣x_{t},x_{0})∣∣p_{θ}(x_{t−1}∣x_{t}))]]$
 where we showed above that $D_{KL}(q(x_{t−1}∣x_{t},x_{0})∣∣p_{θ}(x_{t−1}∣x_{t}))=21 (SNR_{ν}(t−1)−SNR_{ν}(t))∣∣x^_{θ}(x_{t},t)−x_{0}∣∣_{2}$