latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches.
Prior work on ResNets has found that initializing each residual block as the identity function is beneficial. . Diffusion U-Net models use a similar initialization strategy, zero-initializing the final convolutional layer in each block prior to any residual connections. We explore a modification of the adaLN DiT block which does the same. In addition to regressing γ and β, we also regress dimension-wise scaling parameters α that are applied immediately prior to any residual connections within the DiT block. We initialize the MLP to output the zero-vector for all α; this initializes the full DiT block as the identity function.
They also init the final layer to zero, however that’s dependent on what you predict at the end!
Training
AdamW with constant learning rate of 1e-4, no weight decay, they maintain an exponential moving average (EMA) of DiT weights over training with a decay of 0.9999