Training
 (Ho et al. (2020) ) Denoising loss: $L_{simple}=ā£ā£Ļµ_{Īø}(x_{t},t)āĻµ_{t}ā£ā£_{2}$
 This objective can be seen as a reweighted form of $L_{VLB}$ (without the terms affecting $Ī£_{Īø}$ ). The authors found that optimizing this reweighted objective resulted in much better sample quality than optimizing $L_{VLB}$ directly, and explain this by drawing a connection to generative score matching (Song & Ermon, 2019; 2020).
āImproved denoising diffusion probabilistic modelsā (TLDR: they learn $Ī£_{Īø}(x_{t},t))$)
Learning the variance

We want to fit $p_{Īø}(x_{tā1}ā£x_{t})=N(x_{tā1};Ī¼_{Īø}(x_{t},t),Ī£_{Īø}(x_{t},t))$

The variational lower bound loss is derived from VDM $L_{VLB}(Īø)=āp(x_{0}ā£x_{1})+ā_{t}D_{KL}(q(x_{tā1}ā£x_{t},x_{0})ā£ā£p_{Īø}(x_{tā1}ā£x_{t}))$

To fit $Ī¼_{Īø}$, one can simply reparametrize $Ī¼_{Īø}$ as a noise prediction network $Ļµ_{Īø}$, and use $L_{simple}=ā£ā£Ļµ_{Īø}(x_{t},t)āĻµ_{t}ā£ā£_{2}$

However to train $Ī£_{Īø}$, one must use the full $L_{VLB}$

They do $L=L_{simple}+Ī»L_{VLB}$

$Ī¼_{Īø}$ is only trained using $L_{simple}$, thus they use a stopgrad when computing $L_{VLB}$ (https://github.com/openai/improveddiffusion/blob/1bc7bbbdc414d83d4abf2ad8cc1446dc36c4e4d5/improved_diffusion/gaussian_diffusion.py#L679)
 simply use
mean_pred.detach()
as the mean when computing the VLB.
 simply use

For sample quality, the first few steps of the diffusion donāt really matter i.e. very small details. HOWEVER, for maximizing loglikelihood, the first few steps of the diffusion process matter the most as they contribute the most to the variational lower bound (Fig.2 of the paper)
 This is because the likelihood of a training sample $x_{0}$ with very little noise must still be well calibrated and high for this sample. This is not really taken into account when only doing noise matching loss.

In āImproved denoising diffusion probabilistic modelsā, they characterized the variance as: $Ī£_{Īø}(x_{t},t)=exp(vĀlogĀĪ²_{t}+(1āv)logĀĪ²~ā_{t})$

where $Ī²_{t}$ is the variance schedule and $Ī²~ā_{t}$ is the variance of the posterior $q(x_{tā1}ā£x_{t},x_{0})$

and $v$ is a vector containing one component per dimension
Better noise schedule
 cosine schedule
 they use a small offset $s$ to prevent $Ī²_{t}$ to be too small near t=0
 They selected $s=0.008$ such that $Ī²_{0}ā$ was slightly smaller than the pixel bin size 1/127.5.
 Can be smaller for other modalities.
Reducing gradient noise
 $L_{VLB}$ introduces a lot of gradient noise
 gradient noise = l2 norm of concatenated gradient
 Noting that different terms of Lvlb have greatly different magnitudes (Figure 2), we hypothesized that sampling t uniformly causes unnecessary noise in the Lvlb objective
 Simple importance sampling technique reduces this noise
 $L_{VLB}=E[p_{t}L_{t}ā]$ where $p_{t}ā¼E[L_{t}]ā$ and $āp_{t}=1$
 We found that the importance sampling technique was not helpful when optimizing the lessnoisy $L_{hybrid}$ objective directly.
Wurtschen (learned noise gating trick)
 We have $x_{t}=Ī±Ė_{t}āx_{0}+1āĪ±Ė_{t}āĻµ_{t},whereĻµ_{t}ā¼N(0,1)$
 To predict $ĻµĖ$, they do $ĻµĖ=ā£1āBā£+1e_{ā5}x_{t}āAā$
 with $A,B=f_{Īø}(x_{t},t)$, $A$ and $B$ have the same dimension as the noise $Ļµ$. The division is elementwise.
 It makes the training more stable. They hypothesize this occurs because the model parameters are initialized to predict 0 at the beginning, enlarging the difference to timesteps with a lot of noise. By reformulating to the $A&B$ objective, the model initially returns the input, making the loss small for very noised inputs.
 noise prediction: https://github.com/dome272/Wuerstchen/blob/main/modules.py#L307
 diffusion implementation: https://github.com/pabloppp/pytorchtools/blob/master/torchtools/utils/diffusion.py
 Additionally, they do p2 loss weighted noise matching
 $L=p_{2}(t)āā£ā£ĻµāĻµĖā£ā£_{2}$ where $p_{2}(t)=1āĪ±Ė_{t}1+Ī±Ė_{t}ā$
 making higher noise levels contribute more to the loss
Analyzing and Improving the Training Dynamics of Diffusion Models
Noisy training signal
 The training dynamics of diffusion models remain challenging due to the highly stochastic loss function.
 The final image quality is dictated by faint image details predicted throughout the sampling chain
 small mistakes at intermediate steps can have snowball effects in subsequent iterations
 The network must accurately estimate the average clean image across a vast range of noise levels, Gaussian noise realizations, and conditioning inputs.
 Learning to do so is difficult given the chaotic training signal that is randomized over all of these aspects