KL-divergence

A type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q.
A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model instead of P when the actual distribution is P.
- i.e. for every $x \sim p (x)$ , how far away is the ratio $\frac{p ( x )}{q ( x )}$ from 1 ?
$D_{K L} (P ∣∣ Q) = E_{p (x)} [l o g (\frac{p ( x )}{q ( x )})] = \int l o g (\frac{p ( x )}{q ( x )} p (x)) d x$

Properties

$D_{K L} (P ∣∣ Q) \geq 0$ , a result known as Gibbs’ inequality.

Gaussian distributions

Computing the KL

(https://github.com/openai/improved-diffusion/blob/1bc7bbbdc414d83d4abf2ad8cc1446dc36c4e4d5/improved_diffusion/losses.py#L12)

def normal_kl(mean1, logvar1, mean2, logvar2):
    return 0.5 * (
        -1.0
        + logvar2
        - logvar1
        + th.exp(logvar1 - logvar2)
        + ((mean1 - mean2) ** 2) * th.exp(-logvar2)
    )

Full equation: $D_{K L} (p ∣∣ q) = \frac{1}{2} [lo g \frac{∣ Σ _{q} ∣}{∣ Σ _{p} ∣} - k + (μ_{p} - μ_{q})^{T} Σ_{q}^{- 1} (μ_{p} - μ_{q}) + t r {Σ_{q}^{- 1} Σ_{p}}]$
For single-variate: $D_{K L} (p ∣∣ q) = lo g \frac{σ _{2}}{σ _{1}} + \frac{σ _{1}^{2} + ( μ _{1} - μ _{2} ) ^{2}}{2 σ _{2}^{2}} - \frac{1}{2}$
The above code makes the simplifying assumption that $Σ$ is diagonal
Thus, it applies the single-variate formula in parallel to all dimensions.

🤖 Harold's Notes

Explorer

KL-divergence

Properties

Gaussian distributions

Computing the KL

Graph View

Table of Contents

Backlinks