Log-likelihood

Computation

  • Unit: bits or nats
    • the information content in bits of an event with probability is given by
    • nats =
  • Get but you input x_t=x_0 + very little noise and t=0
  • Given true ,
    • Literally compute
    • You have to weird tricks with the CDF when applying to images because we’re in continous space, basically doing P(X > x0 + 1/255) - P(X> x0 - 1/255)
  • Sometimes also defined as the sum of all part of

Quality

Fréchet Inception Distance (FID)

FID is a very popular metric for evaluating the quality of images generated by models like GANs and diffusion models. It measures the distance between the feature vectors of real and generated images.

  • Calculation: The FID is calculated by first using a feature extractor like the Inception network to transform both the set of real images and the set of generated images into a feature space. Then, it calculates the mean and covariance of these feature vectors for both real and generated images. The FID score is then the Fréchet distance (also known as the Wasserstein-2 distance) between these two Gaussian distributions:

    FID=∥μx−μg∥2+Tr(Σx+Σg−2(ΣxΣg)1/2)FID=∥μx​−μg​∥2+Tr(Σx​+Σg​−2(Σx​Σg​)1/2)

    where μx,Σxμx​,Σx​ are the mean and covariance of the real data features, and μg,Σgμg​,Σg​ are those for the generated data.

  • Purpose: Lower FID scores indicate that the distributions of generated images are closer to the real images, suggesting better quality and diversity.

Diversity

Inception Score (IS)

The Inception Score is another metric used primarily for images. It uses the Inception model to calculate the diversity and quality of generated images.

  • Calculation: IS uses the conditional label distribution p(y∣x)p(y∣x) predicted by the Inception model for each image generated by the diffusion model. The score is computed as:

    IS=exp⁡(Ex[DKL(p(y∣x)∥∥p(y))])IS=exp(Ex​[DKL​(p(y∣x)∥∥p(y))])

    where ​ is the Kullback-Leibler divergence between the conditional distribution and the marginal distribution , which is obtained by averaging the true over all generated images.

  • Purpose: A higher Inception Score indicates that the generated images are both meaningful (the model is confident about the labels) and diverse (different images have different predicted labels).