Loglikelihood
Computation
 Unit: bits or nats
 the information content in bits of an event with probability $p$ is given by $−log_{2}(p)$
 nats = $−log_{e}(p)$
 Get $p_{θ}(x_{t−1}∣x_{t})=N(x_{t−1};μ_{θ}(x_{t},t),Σ_{θ}(x_{t},t))$ but you input x_t=x_0 + very little noise and t=0
 basically return mean and variance defined in The diffusion process
 Given true $x_{0}$,
 Literally compute $p(x_{0}∣μ_{θ}(x_{0},0),Σ_{θ}(x_{0},0))$
 You have to weird tricks with the CDF when applying to images because we’re in continous space, basically doing P(X > x0 + 1/255)  P(X> x0  1/255)
 Sometimes also defined as the sum of all part of $L_{vlb}$
Quality
Fréchet Inception Distance (FID)
FID is a very popular metric for evaluating the quality of images generated by models like GANs and diffusion models. It measures the distance between the feature vectors of real and generated images.

Calculation: The FID is calculated by first using a feature extractor like the Inception network to transform both the set of real images and the set of generated images into a feature space. Then, it calculates the mean and covariance of these feature vectors for both real and generated images. The FID score is then the Fréchet distance (also known as the Wasserstein2 distance) between these two Gaussian distributions:
FID=∥μx−μg∥2+Tr(Σx+Σg−2(ΣxΣg)1/2)FID=∥μx−μg∥2+Tr(Σx+Σg−2(ΣxΣg)1/2)
where μx,Σxμx,Σx are the mean and covariance of the real data features, and μg,Σgμg,Σg are those for the generated data.

Purpose: Lower FID scores indicate that the distributions of generated images are closer to the real images, suggesting better quality and diversity.
Diversity
Inception Score (IS)
The Inception Score is another metric used primarily for images. It uses the Inception model to calculate the diversity and quality of generated images.

Calculation: IS uses the conditional label distribution p(y∣x)p(y∣x) predicted by the Inception model for each image $x$ generated by the diffusion model. The score is computed as:
IS=exp(Ex[DKL(p(y∣x)∥∥p(y))])IS=exp(Ex[DKL(p(y∣x)∥∥p(y))])
where $D_{KL}$ is the KullbackLeibler divergence between the conditional distribution $p(y∣x)$ and the marginal distribution $p(y)$, which is obtained by averaging the true $p(y∣x)$ over all generated images.

Purpose: A higher Inception Score indicates that the generated images are both meaningful (the model is confident about the labels) and diverse (different images have different predicted labels).