Training diffusion models

Training

(Ho et al. (2020) ) Denoising loss: $L_{s im pl e} = ∣∣ ϵ_{θ} (x_{t}, t) - ϵ_{t} ∣ ∣_{2}^{2}$
- This objective can be seen as a reweighted form of $L_{V L B}$ (without the terms affecting $Σ_{θ}$ ). The authors found that optimizing this reweighted objective resulted in much better sample quality than optimizing $L_{V L B}$ directly, and explain this by drawing a connection to generative score matching (Song & Ermon, 2019; 2020).

”Improved denoising diffusion probabilistic models” (TLDR: they learn $Σ_{θ} (x_{t}, t))$ )

Learning the variance

We want to fit $p_{θ} (x_{t - 1} ∣ x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t))$
The variational lower bound loss is derived from VDM $L_{V L B} (θ) = - p (x_{0} ∣ x_{1}) + \sum_{t} D_{K L} (q (x_{t - 1} ∣ x_{t}, x_{0}) ∣∣ p_{θ} (x_{t - 1} ∣ x_{t}))$
To fit $μ_{θ}$ , one can simply reparametrize $μ_{θ}$ as a noise prediction network $ϵ_{θ}$ , and use $L_{s im pl e} = ∣∣ ϵ_{θ} (x_{t}, t) - ϵ_{t} ∣ ∣_{2}^{2}$
However to train $Σ_{θ}$ , one must use the full $L_{V L B}$
They do $L = L_{s im pl e} + λ L_{V L B}$
$μ_{θ}$ is only trained using $L_{s im pl e}$ , thus they use a stop-grad when computing $L_{V L B}$ (https://github.com/openai/improved-diffusion/blob/1bc7bbbdc414d83d4abf2ad8cc1446dc36c4e4d5/improved_diffusion/gaussian_diffusion.py#L679)
- simply use mean_pred.detach() as the mean when computing the VLB.
For sample quality, the first few steps of the diffusion don’t really matter i.e. very small details. HOWEVER, for maximizing log-likelihood, the first few steps of the diffusion process matter the most as they contribute the most to the variational lower bound (Fig.2 of the paper)
- This is because the likelihood of a training sample $x_{0}$ with very little noise must still be well calibrated and high for this sample. This is not really taken into account when only doing noise matching loss.
In “Improved denoising diffusion probabilistic models”, they characterized the variance as: $Σ_{θ} (x_{t}, t) = e x p (v l o g β_{t} + (1 - v) l o g \tilde{β}_{t})$
where $β_{t}$ is the variance schedule and $\tilde{β}_{t}$ is the variance of the posterior $q (x_{t - 1} ∣ x_{t}, x_{0})$
and $v$ is a vector containing one component per dimension

Better noise schedule

cosine schedule
they use a small offset $s$ to prevent $β_{t}$ to be too small near t=0
They selected $s = 0.008$ such that $β_{0}$ was slightly smaller than the pixel bin size 1/127.5.
Can be smaller for other modalities.

Reducing gradient noise

$L_{V L B}$ introduces a lot of gradient noise
gradient noise = l2 norm of concatenated gradient
Noting that different terms of Lvlb have greatly different magnitudes (Figure 2), we hypothesized that sampling t uniformly causes unnecessary noise in the Lvlb objective
Simple importance sampling technique reduces this noise
- $L_{V L B} = E [\frac{L _{t}}{p _{t}}]$ where $p_{t} \sim E [L_{t}^{2}]$ and $\sum p_{t} = 1$
We found that the importance sampling technique was not helpful when optimizing the less-noisy $L_{h y b r i d}$ objective directly.

Wurtschen (learned noise gating trick)

We have $x_{t} = \overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ϵ_{t}, where ϵ_{t} \sim N (0, 1)$
To predict $\overset{ϵ}{ˉ}$ , they do $\overset{ϵ}{ˉ} = \frac{x _{t} - A}{∣1 - B ∣ + 1 e ^{- 5}}$
with $A, B = f_{θ} (x_{t}, t)$ , $A$ and $B$ have the same dimension as the noise $ϵ$ . The division is element-wise.
It makes the training more stable. They hypothesize this occurs because the model parameters are initialized to predict 0 at the beginning, enlarging the difference to timesteps with a lot of noise. By reformulating to the $A & B$ objective, the model initially returns the input, making the loss small for very noised inputs.
noise prediction: https://github.com/dome272/Wuerstchen/blob/main/modules.py#L307
diffusion implementation: https://github.com/pabloppp/pytorch-tools/blob/master/torchtools/utils/diffusion.py
Additionally, they do p2 loss weighted noise matching
- $L = p_{2} (t) \cdot ∣∣ ϵ - \overset{ϵ}{ˉ} ∣ ∣^{2}$ where $p_{2} (t) = \frac{1 + α ˉ _{t}}{1 - α ˉ _{t}}$
- making higher noise levels contribute more to the loss

Analyzing and Improving the Training Dynamics of Diffusion Models

Noisy training signal

The training dynamics of diffusion models remain challenging due to the highly stochastic loss function.
- The final image quality is dictated by faint image details predicted throughout the sampling chain
- small mistakes at intermediate steps can have snowball effects in subsequent iterations
The network must accurately estimate the average clean image across a vast range of noise levels, Gaussian noise realizations, and conditioning inputs.
Learning to do so is difficult given the chaotic training signal that is randomized over all of these aspects

🤖 Harold's Notes

Explorer

Training diffusion models

Training

”Improved denoising diffusion probabilistic models” (TLDR: they learn $Σ_{θ} (x_{t}, t))$ )

Learning the variance

Better noise schedule

Reducing gradient noise

Wurtschen (learned noise gating trick)

Analyzing and Improving the Training Dynamics of Diffusion Models

Noisy training signal

Graph View

Table of Contents

Backlinks

🤖 Harold's Notes

Explorer

Training diffusion models

Training

”Improved denoising diffusion probabilistic models” (TLDR: they learn Σθ​(xt​,t)))

Learning the variance

Better noise schedule

Reducing gradient noise

Wurtschen (learned noise gating trick)

Analyzing and Improving the Training Dynamics of Diffusion Models

Noisy training signal

Graph View

Table of Contents

Backlinks

”Improved denoising diffusion probabilistic models” (TLDR: they learn $Σ_{θ} (x_{t}, t))$ )