
For many modalities, we can think of the data we observe as represented or generated by an associated unseen latent variable, which we can denote by random variable $z$.

SOURCE: # Understanding Diffusion Models: A Unified Perspective
Evidence Lower Bound
 Mathematically, we can imagine the latent variables and the data we observe as modeled by a joint distribution $p(x,z)$.
 One approach of generative modeling, termed “likelihoodbased”, is to learn a model to maximize the likelihood $p(x)$ of all observed $x$. Two ways to go about manipulate this joint distribution to recover the likelihood:
 Equation 1: marginalization: $p(x)=∫p(x,z)dz$
 Equation 2: chain rule: $p(x)=p(z∣x)p(x,z) $
 Computing any of them is kinda hard
 However, using the two equations, we can lowerbound the likelihood, which would give us a good enough proxy objective to maximize. ⇒ ELBO
 $log p(x)≥E_{q_{ϕ}(z∣x)}[logq_{ϕ}(z∣x)p(x,z) ]$
 $q_{ϕ}(z∣x)$ is an approximate variational distribution of $p(z∣x)$ with parameters $ϕ$ to optimize
 ELBO can be derived from writing $log p(x)$ and
 applying Eq.1, multiply by $1=q_{ϕ}(z∣x)q_{ϕ}(z∣x) $, apply Jensen Inequality (not very informative)
 multiply by $1=∫q_{ϕ}(z∣x)dz$, applying Eq.2, multiply by $1=q_{ϕ}(z∣x)q_{ϕ}(z∣x) $, split the expectation using the log, and arriving at
 $log p(x)=E_{q_{ϕ}(z∣x)}[logq_{ϕ}(z∣x)p(x,z) ]+D_{KL}(q_{ϕ}(z∣x)∣∣p(z∣x))$
 KL divergence always $≥0$
 Conclusion: ELBO proxy is as good as the KL divergence between the approximate posterior $q_{ϕ}(z∣x)$ and the true posterior $p(z∣x)$
 ⇒ Maximizing the ELBO will minimize the KL divergence
Variational Autoencoders
 In the default formation of the VAE, we directly maximize the ELBO.
 It’s called variational because we optimize for the best $q_{ϕ}(z∣x)$ amongst a family of potential posterior distributions parameterized by $ϕ$
 It’s called autoencoder because it follow the usual autoencoder architecture
 In practice, there’s thus two sets of parameters optimized $ϕ$ for the encoder and $θ$ for the decoder.
 Let’s see how maximizing the ELBO makes sense in this context:
 $p_{θ}(x∣z)$ is a deterministic function (decoder) to convert a given latent vector $z$ into an observation $x$. This explicitly assumes a somewhat deterministic mapping between x and z.
 $q_{ϕ}(z∣x)$ can be seen as an intermediate bottlenecking distribution (encoder)
How to train
 A defining feature of the VAE is how the ELBO is optimized jointly over parameters $ϕ$ and $θ$.
 The encoder of the VAE is commonly chosen to model a multivariate Gaussian with diagonal covariance, and the prior is often selected to be a standard multivariate Gaussian
 $q_{ϕ}(z∣x)=N(z;μ_{ϕ}(x),σ_{ϕ}(x))$
 $p(z)=N(z;0,I)$
 We learn the bottleneck mean and covariance
Maximizing ELBO
 KL divergence can be computed analytically
 Reconstruction term through Monte Carlo
 Objective: $ϕ,θargmax ∑_{l=1}[logp_{θ}(x∣z_{(l)})]−D_{KL}(q_{ϕ}(z∣x)∣∣p(z))$
 IMPORTANT where the latents $z_{(l)}$ are sampled from $q_{ϕ}(z∣x)$ for every sample $x$ in the dataset i.e. making sure that going x → z → x is a good reconstruction
 the KL divergence ensures that the posterior maps to a wellbehaved distribution, ensuring that we’ll be able to sample from $p(z)$ to create samples later on. (obviously also ensures that we’re maximizing the ELBO)
Reparameterization trick
 The latents $z_{(l)}$ that are obtained from going through the encoder and then sampling need to be differentiable because they are passed on the decoder for the reconstruction
 To ensure this, each $z$ is computed as a determinstic function of the input $x$ and auxiliary noise $ϵ$
 $q_{ϕ}(z∣x)=N(z;μ_{ϕ}(x),σ_{ϕ}(x))$ (in theory)
 $z=μ_{ϕ}(x),+σ_{ϕ}(x)⊙ϵ$ (in practice) with $ϵ∼N(0,I)$
Encoder
 We can use $q_{ϕ}(z∣x)$ as a meaningful and useful representation (for latent diffusion!!)
Posterior collapse
What is it
 Posterior collapse occurs when the approximate posterior $q(z∣x)$ collapses to the prior $p(z)$ irrespective of the input $x$.
 This means that the latent variable $z$ carries very little information about the input $x$
 effectively rendering the latent space meaningless.
 In this scenario, the encoder outputs a distribution that is very close to the prior
 making the KL divergence term very small,
 but at the cost of losing meaningful representations of the input data.
Causes

Overpowering Decoder: If the decoder is too powerful, it can learn to reconstruct the input data directly from the prior distribution, making the latent variable $z$ redundant.

High KL Weight: During training, the KL divergence term can dominate the loss function, pushing the posterior $q(z∣x)$ to align closely with the prior $p(z)$.
How to address it
 Adjusting the KL Weight: Introducing an annealing schedule where the weight of the KL divergence term is gradually increased during training can help prevent early posterior collapse.
 VAE Variants: Using alternative VAE architectures such as βVAE, which introduces an adjustable weight on the KL divergence term, or hierarchical VAEs, which have more complex latent structures, can help avoid posterior collapse.
 Structured Priors: Using more complex priors than a simple Gaussian can help in maintaining a meaningful latent space.