• latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches.
  • Prior work on ResNets has found that initializing each residual block as the identity function is beneficial. . Diffusion U-Net models use a similar initialization strategy, zero-initializing the final convolutional layer in each block prior to any residual connections. We explore a modification of the adaLN DiT block which does the same. In addition to regressing γ and β, we also regress dimension-wise scaling parameters α that are applied immediately prior to any residual connections within the DiT block. We initialize the MLP to output the zero-vector for all α; this initializes the full DiT block as the identity function.
  • https://github.com/facebookresearch/DiT/blob/main/models.py#L145 (init to zero)
  • They also init the final layer to zero, however that’s dependent on what you predict at the end!


  • AdamW with constant learning rate of 1e-4, no weight decay, they maintain an exponential moving average (EMA) of DiT weights over training with a decay of 0.9999
  • batch-size of 256
  • = 1000


  • 250 DDPM steps

Dimensions for small models:

  • emb_dim = 384, num_heads=6, n_layers=12