Score-based Generative models

Can define arbitary flexible distribution using the Boltzmann distribution
- $p_{θ} (x) = \frac{1}{Z _{θ}} e^{- f_{θ} (x)}$
- Normalization constant is intractable
- The score function removes it thanks to the log + gradient trick
- $\nabla_{x} l o g p_{θ} (x) = - \nabla_{x} f_{θ} (x) \approx s_{θ} (x)$

The sampling procedure is (Markov-Chain Monte Carlo): $x_{i + 1} \leftarrow x_{i} + c \nabla l o g p (x_{i}) + 2 c ϵ$
where $x_{0}$ is randomly sampled from a prior distribution (such as uniform) and $ϵ$ is some gaussian noise to prevent mode collapse.

What does the score function represent? For every $x$ , taking the gradient of its log likelihood with respect to $x$ essentially describes what direction in data space to move in order to further increase its likelihood.
Intuitively, then, the score function defines a vector field over the entire space that data $x$ inhabits, pointing towards the modes.

Tweedie’s Formula states that the true mean of an exponential family distribution, given samples drawn from it, can be estimated by the maximum likelihood estimate of the samples (aka empirical mean) plus some correction term involving the score of the estimate. The score serves a correction in case of sample bias.
For a Gaussian variable $z \sim N (z; μ_{z}, Σ_{z})$ , the Tweedie’s Formula states that: $E [μ_{z} ∣ z] = z + Σ_{z} \nabla_{z} l o g p (z)$
We know that our noisy samples $x_{t} \sim N (x_{t}; \overset{α}{ˉ}_{t} x_{0}, (1 - \overset{α}{ˉ}_{t}) I)$
Thus, by Tweedie’s Formula, $E [μ_{x_{t}} ∣ x_{t}] = x_{t} + (1 - \overset{α}{ˉ}_{t}) \nabla_{x_{t}} l o g p (x_{t})$
Thus, according to the formula, our best estimate estimate for $x_{0}$ is : $x_{0} = \frac{x _{t}}{α ˉ _{t}} + \frac{( 1 - α ˉ _{t} ) \nabla _{x_{t}} l o g p ( x _{t} )}{α ˉ _{t}}$
Thus, we should have $s_{θ} (x_{t}, t)$ is a neural network that learns to predict the score function $\nabla_{x_{t}} l o g p (x_{t})$ , which is the gradient of $x_{t}$ in data space, for any arbitrary noise level t. The score function measures how to move in data space to maximize the log probability.
Source noise $ϵ_{0}$ and the score $\nabla_{x_{t}} l o g p (x_{t})$ describe something very similar i.e. $\nabla_{x_{t}} l o g p (x_{t}) = - \frac{1}{1 - α ˉ _{t}} ϵ_{0}$

🤖 Harold's Notes