• For many modalities, we can think of the data we observe as represented or generated by an associated unseen latent variable, which we can denote by random variable .

  • SOURCE: # Understanding Diffusion Models: A Unified Perspective

Evidence Lower Bound

  • Mathematically, we can imagine the latent variables and the data we observe as modeled by a joint distribution .
  • One approach of generative modeling, termed “likelihood-based”, is to learn a model to maximize the likelihood of all observed . Two ways to go about manipulate this joint distribution to recover the likelihood:
    • Equation 1: marginalization:
    • Equation 2: chain rule:
    • Computing any of them is kinda hard
  • However, using the two equations, we can lower-bound the likelihood, which would give us a good enough proxy objective to maximize. ELBO
    • is an approximate variational distribution of with parameters to optimize
    • ELBO can be derived from writing and
      • applying Eq.1, multiply by , apply Jensen Inequality (not very informative)
      • multiply by , applying Eq.2, multiply by , split the expectation using the log, and arriving at
        • KL divergence always
  • Conclusion: ELBO proxy is as good as the KL divergence between the approximate posterior and the true posterior
  • Maximizing the ELBO will minimize the KL divergence

Variational Autoencoders

  • In the default formation of the VAE, we directly maximize the ELBO.
    • It’s called variational because we optimize for the best amongst a family of potential posterior distributions parameterized by
    • It’s called autoencoder because it follow the usual auto-encoder architecture
  • In practice, there’s thus two sets of parameters optimized for the encoder and for the decoder.
  • Let’s see how maximizing the ELBO makes sense in this context:
    • is a deterministic function (decoder) to convert a given latent vector into an observation . This explicitly assumes a somewhat deterministic mapping between x and z.
    • can be seen as an intermediate bottlenecking distribution (encoder)

How to train

  • A defining feature of the VAE is how the ELBO is optimized jointly over parameters and .
  • The encoder of the VAE is commonly chosen to model a multivariate Gaussian with diagonal covariance, and the prior is often selected to be a standard multivariate Gaussian
    • We learn the bottleneck mean and covariance

Maximizing ELBO

  • KL divergence can be computed analytically
  • Reconstruction term through Monte Carlo
  • Objective:
  • IMPORTANT where the latents are sampled from for every sample in the dataset i.e. making sure that going x z x is a good reconstruction
  • the KL divergence ensures that the posterior maps to a well-behaved distribution, ensuring that we’ll be able to sample from to create samples later on. (obviously also ensures that we’re maximizing the ELBO)

Reparameterization trick

  • The latents that are obtained from going through the encoder and then sampling need to be differentiable because they are passed on the decoder for the reconstruction
  • To ensure this, each is computed as a determinstic function of the input and auxiliary noise
    • (in theory)
    • (in practice) with

Encoder

  • We can use as a meaningful and useful representation (for latent diffusion!!)

Posterior collapse

What is it

  • Posterior collapse occurs when the approximate posterior collapses to the prior irrespective of the input .
  • This means that the latent variable carries very little information about the input
    • effectively rendering the latent space meaningless.
  • In this scenario, the encoder outputs a distribution that is very close to the prior
    • making the KL divergence term very small,
    • but at the cost of losing meaningful representations of the input data.

Causes

  • Overpowering Decoder: If the decoder is too powerful, it can learn to reconstruct the input data directly from the prior distribution, making the latent variable redundant.

  • High KL Weight: During training, the KL divergence term can dominate the loss function, pushing the posterior to align closely with the prior .

How to address it

  • Adjusting the KL Weight: Introducing an annealing schedule where the weight of the KL divergence term is gradually increased during training can help prevent early posterior collapse.
  • VAE Variants: Using alternative VAE architectures such as β-VAE, which introduces an adjustable weight on the KL divergence term, or hierarchical VAEs, which have more complex latent structures, can help avoid posterior collapse.
  • Structured Priors: Using more complex priors than a simple Gaussian can help in maintaining a meaningful latent space.