“Understanding Diffusion Models: A Unified Perspective” + “Variational Diffusion” from Kingma
The easiest way to think of a Variational Diffusion Model (VDM) is simply as a Markovian Hierarchical Variational Autoencoder with three key restrictions:
1. The latent dimension is exactly equal to the data dimension
  1. This means that $q (z_{t + 1} ∣ z_{t}) = q (x_{t + 1} ∣ x_{t})$ . Everything stays in $x$ space.
2. The structure of the latent encoder at each timestep is not learned
  - It is pre-defined as a linear Gaussian model. In other words, it is a Gaussian distribution centered around the output of the previous timestep
3. The Gaussian parameters of the latent encoders vary over time in such a way that the distribution of the latent at final timestep $T$ is a standard Gaussian
Diagram

ELBO derivation

The variational lower bound loss is derived from $E_{q (x_{1 : T} ∣ x_{0})} [l o g \frac{p ( x _{0 : T} )}{q ( x _{1 : T} ∣ x _{0} )}]$ where we take advantage of the fact that both $p (x_{0 : T})$ and $q (x_{1 : T} ∣ x_{0})$ are Markovian, and thus the log-product can be easily decomposed into incremental steps.

We obtain $l o g p (x) \geq E_{q (x_{1} ∣ x_{0})} [l o g p_{θ} (x_{0} ∣ x_{1})] - \sum_{t = 2}^{T} E_{q (x_{t} ∣ x_{0})} [D_{K L} (q (x_{t - 1} ∣ x_{t}, x_{0}) ∣∣ p_{θ} (x_{t - 1} ∣ x_{t}))] - D_{K L} (q (x_{T} ∣ x_{0}) ∣∣ p (x_{T}))$
Reconstruction term
- where the first term is the reconstruction term, like its analogue in the ELBO of vanilla VAE.
  - This term can be approximated and optimized using a Monte Carlo estimate
Denoising matching terms
- the $T - 2$ terms are denoising matching terms.
  - We learn desired denoising transition step $p_{θ} (x_{t - 1} ∣ x_{t})$ as an approximation to tractable, ground truth denoising transition step $q (x_{t - 1} ∣ x_{t}, x_{0})$ (which closed form is derived in The diffusion process)
  - Detail: when originally deriving the ELBO, you get “consistency terms”
    - i.e. $E_{q (x_{t - 1}, x_{t + 1} ∣ x_{0})} [D_{K L} (q (x_{t} ∣ x_{t - 1}) ∣∣ p_{θ} (x_{t} ∣ x_{t + 1}))]$ , where a denoising step from a noisier image should match the corresponding noising step from a cleaner image.
    - However, actually optimizing the ELBO using the terms we just derived might be suboptimal; because the consistency term is computed as an expectation over two random variables ${x_{t - 1}, x_{t + 1}}$ for every timestep, the variance of its Monte Carlo estimate could potentially be higher than a term that is estimated using only one random variable per timestep.
    - The trick to obtain denoising matching terms is to rewrite the encoder transition as $q (x_{t} ∣ x_{t - 1}) = q (x_{t} ∣ x_{t - 1}, x_{0})$ , where the extra conditioning is superfluous due to the Markov property and then use Bayes rule to rewrite each transition as $q (x_{t} ∣ x_{t - 1}, x_{0}) = \frac{q ( x _{t - 1} ∣ x _{t} , x _{0} ) q ( x _{t} ∣ x _{0} )}{q ( x _{t - 1} ∣ x _{0} )}$
Prior matching term
- The last form enforces isotropic Gaussian at the end of the diffusion, it’s never optimized.

Optimizing in practice

Maximizing ELBO, equivalent to denoising (when variance schedule is fixed)

Remember that using Bayes theorem, one can calculate the posterior $q (x_{t - 1} ∣ x_{t}, x_{0})$ in terms of $\tilde{β}_{t}$ and $\tilde{μ} (x_{t}, x_{0})$ which are defined as follows:
- $\tilde{β}_{t} = \frac{1 - α ˉ _{t - 1}}{1 - α ˉ _{t}} β_{t}$ (posterior variance schedule)
- $\tilde{μ} (x_{t}, x_{0}) = \frac{α ˉ _{t - 1} β _{t}}{1 - α ˉ _{t}} x_{0} + \frac{α ˉ _{t} ( 1 - α ˉ _{t - 1} )}{1 - α ˉ _{t}} x_{t}$
- $q (x_{t - 1} ∣ x_{t}, x_{0}) = N (x_{t - 1}; \tilde{μ} (x_{t}, x_{0}), \tilde{β}_{t} I)$
The KL-divergence between two Gaussian disitributions is composed of an MSE of the two means (divided by some variance term) + some terms about the variances
In the case, where we fix the variance schedule $β_{t}$ ,
- we can match exactly the two distributions variances, and thus minimizing the KL-divergence is exactly equivalent to minimize the MSE between the two means ⇒ denoising objective: $θ a r g min D_{K L} (q (x_{t - 1} ∣ x_{t}, x_{0}) ∣∣ p_{θ} (x_{t - 1} ∣ x_{t})) = θ a r g min \frac{1}{2 β ~ _{t}} ∣∣ μ_{θ} (x_{t}, t) - \tilde{μ} (x_{t}, x_{0}) ∣ ∣^{2}$
- By setting $μ_{θ} (x_{t}, t) = \frac{α ˉ _{t - 1} β _{t}}{1 - α ˉ _{t}} \overset{x}{^}_{θ} (x_{t}, t) + \frac{α ˉ _{t} ( 1 - α ˉ _{t - 1} )}{1 - α ˉ _{t}} x_{t}$ (similar parametrization as $\tilde{μ} (x_{t}, x_{0})$ ), we recover: $θ a r g min \frac{1}{2 β ~ _{t}} \frac{α ˉ _{t - 1} ( 1 - α _{t} ) ^{2}}{( 1 - α ˉ _{t} ) ^{2}} ∣∣ \overset{x}{^}_{θ} (x_{t}, t) - x_{0} ∣ ∣^{2}$
- Optimizing a VDM boils down to learning a neural network to predict the original ground truth image from an arbitrarily noisified version of it
  - Just do $θ a r g min E_{t \sim U {2, T}} [denoising term at time t]$

Learning Diffusion Noise Parameters (variance schedule is not fixed)

Above, we wrote our per-timestep objective as $\frac{1}{2 β ^ _{t}} \frac{α ˉ _{t - 1} ( 1 - α _{t} ) ^{2}}{( 1 - α ˉ _{t} ) ^{2}} ∣∣ \overset{x}{^}_{θ} (x_{t}, t) - x_{0} ∣ ∣^{2}$
By plugging in $\tilde{β}_{t} = \frac{1 - α ˉ _{t - 1}}{1 - α ˉ _{t}} β_{t} = \frac{1 - α ˉ _{t - 1} ( 1 - α _{t} )}{1 - α ˉ _{t}}$ in the above equation, we get $\frac{1}{2} (\frac{α ˉ _{t - 1}}{1 - α ˉ _{t - 1}} - \frac{α ˉ _{t}}{1 - α ˉ _{t}}) ∣∣ \overset{x}{^}_{θ} (x_{t}, t) - x_{0} ∣ ∣^{2} = \frac{1}{2} (SNR (t - 1) - SNR (t)) ∣∣ \overset{x}{^}_{θ} (x_{t}, t) - x_{0} ∣ ∣^{2}$
- Recall that $q (x_{t} ∣ x_{0}) = N (x_{t}; \overset{α}{ˉ}_{t} x_{0}, (1 - \overset{α_{t}}{ˉ}) I)$ .
- Then following the definition of the signal-to-noise ratio (SNR) as $SNR = \frac{μ ^{2}}{σ ^{2}}$ , we can write $SNR (t) = \frac{α ˉ _{t}}{1 - α ˉ _{t}}$
. In a diffusion model, we require the SNR to monotonically decrease as timestep $t$ increases.

Parametrizing the SNR

We can directly parameterize the SNR at each timestep using a neural network, and learn it jointly along with the diffusion model.
As the SNR must monotonically decrease over time, we can represent it as:
- $SNR_{ν} (t) = exp (- w_{ν} (t))$
- where $w_{ν} (t)$ is modeled as a monotonically increasing neural network with parameters $ν$ .
- We can write elegant forms for the $α$ ‘s
  - $\frac{α ˉ _{t}}{1 - α ˉ _{t}} = exp (- w_{ν} (t))$
  - $\overset{α}{ˉ}_{t} = sigmoid (- w_{ν} (t)$
  - $1 - \overset{α}{ˉ}_{t} = sigmoid (w_{ν} (t)$

Optimizing everything

We now optimize $θ, ν a r g min E_{t \sim U {2, T}} [E_{q (x_{t} ∣ x_{0})} [D_{K L} (q (x_{t - 1} ∣ x_{t}, x_{0}) ∣∣ p_{θ} (x_{t - 1} ∣ x_{t}))]]$
where we showed above that $D_{K L} (q (x_{t - 1} ∣ x_{t}, x_{0}) ∣∣ p_{θ} (x_{t - 1} ∣ x_{t})) = \frac{1}{2} (SNR_{ν} (t - 1) - SNR_{ν} (t)) ∣∣ \overset{x}{^}_{θ} (x_{t}, t) - x_{0} ∣ ∣^{2}$

🤖 Harold's Notes

Explorer

Variational diffusion

ELBO derivation

Optimizing in practice

Maximizing ELBO, equivalent to denoising (when variance schedule is fixed)

Learning Diffusion Noise Parameters (variance schedule is not fixed)

Parametrizing the SNR

Optimizing everything

Graph View

Table of Contents

Backlinks