• “Understanding Diffusion Models: A Unified Perspective” + “Variational Diffusion” from Kingma

  • The easiest way to think of a Variational Diffusion Model (VDM) is simply as a Markovian Hierarchical Variational Autoencoder with three key restrictions:

    1. The latent dimension is exactly equal to the data dimension
      1. This means that . Everything stays in space.
    2. The structure of the latent encoder at each timestep is not learned
      • It is pre-defined as a linear Gaussian model. In other words, it is a Gaussian distribution centered around the output of the previous timestep
    3. The Gaussian parameters of the latent encoders vary over time in such a way that the distribution of the latent at final timestep is a standard Gaussian
  • Diagram

ELBO derivation

The variational lower bound loss is derived from where we take advantage of the fact that both and are Markovian, and thus the log-product can be easily decomposed into incremental steps.

  • We obtain
  • Reconstruction term
    • where the first term is the reconstruction term, like its analogue in the ELBO of vanilla VAE.
      • This term can be approximated and optimized using a Monte Carlo estimate
  • Denoising matching terms
    • the terms are denoising matching terms.
      • We learn desired denoising transition step as an approximation to tractable, ground truth denoising transition step (which closed form is derived in The diffusion process)
      • Detail: when originally deriving the ELBO, you get “consistency terms”
        • i.e. , where a denoising step from a noisier image should match the corresponding noising step from a cleaner image.
        • However, actually optimizing the ELBO using the terms we just derived might be suboptimal; because the consistency term is computed as an expectation over two random variables for every timestep, the variance of its Monte Carlo estimate could potentially be higher than a term that is estimated using only one random variable per timestep.
        • The trick to obtain denoising matching terms is to rewrite the encoder transition as , where the extra conditioning is superfluous due to the Markov property and then use Bayes rule to rewrite each transition as
  • Prior matching term
    • The last form enforces isotropic Gaussian at the end of the diffusion, it’s never optimized.

Optimizing in practice

Maximizing ELBO, equivalent to denoising (when variance schedule is fixed)

  • Remember that using Bayes theorem, one can calculate the posterior in terms of and which are defined as follows:

    • (posterior variance schedule)
  • The KL-divergence between two Gaussian disitributions is composed of an MSE of the two means (divided by some variance term) + some terms about the variances

  • In the case, where we fix the variance schedule ,

    • we can match exactly the two distributions variances, and thus minimizing the KL-divergence is exactly equivalent to minimize the MSE between the two means denoising objective:
    • By setting (similar parametrization as ), we recover:
    • Optimizing a VDM boils down to learning a neural network to predict the original ground truth image from an arbitrarily noisified version of it
      • Just do

Learning Diffusion Noise Parameters (variance schedule is not fixed)

  • Above, we wrote our per-timestep objective as
  • By plugging in in the above equation, we get
    • Recall that .
    • Then following the definition of the signal-to-noise ratio (SNR) as , we can write
  • . In a diffusion model, we require the SNR to monotonically decrease as timestep increases.

Parametrizing the SNR

  • We can directly parameterize the SNR at each timestep using a neural network, and learn it jointly along with the diffusion model.
  • As the SNR must monotonically decrease over time, we can represent it as:
    • where is modeled as a monotonically increasing neural network with parameters .
    • We can write elegant forms for the ‘s

Optimizing everything

  • We now optimize
  • where we showed above that