• Related to VQ-VAE
  • For 224x224 images, 4M produces 14x14 tokens.
    • ViT VQ-VAE: segmentation, CLIP, DinoV2, ImageBind, SAM instances (non-natural images)
  • ViT VQ-VAE with diffusion decoder: RGB, normal, depth, edges (natural images)
  • MLP VQ-VAE: Dinov2, ImageBind global embeddings, 3D human poses (fixed-sized vectors)
  • Subword tokenizers: captions, metadata, bounding boxes

VQ-GAN

  • Keep the quantization-related losses
  • Replaces the reconstruction loss (pixel-L2) by a “perceptual loss” (i.e. ImageNet embeddings based L2)
  • Introduces an adversarial training procedure with a patch-based discriminator that aims to differentiate between real and reconstructed images.
    • Can use StyleGAN Discriminator

How 4M did it

Common practices

  • They follow guidelines from “Vector-quantized image modeling with improved VQGAN” and “SoundStream: An end-to-end neural audio codec”

Switching from CNNs to ViT

  • Replace the CNN encoder/decoder by a ViT.
    • Given sufficient data (for which unlabeled image data is plentiful), ViT VQ-VAE is less constrained by the inductive priors imposed by convolution

Low codebook usage fixes

  • Vanilla VQ-GANs usually suffer from low codebook usage due to the poor initialization of the codebook.
    • During training a significant portion of codes are rarely used, or dead.
    • Can also lead to joint VQ-VAE and diffusion training collapse.
  • There are three improvements that can significantly encourage the codebook usage even with a larger codebook size of 8192

Factorized codes / reducing the latent space size during lookup

  • Introduce a linear projection from the output of the encoder to a low-dimensional latent variable space for code index lookup (e.g., reduced from a 768-d vector to a 32-d or 8-d vector per code)
    • i,e. reduce the latent dimension when calculating nearest neighbour in the latent embedding space vocab. where is the linear projector to reduce dimensionality.

-normalized codes

  • Apply l2 normalization on the encoded latent variables and codebook latent variables . The codebook variables are initialized from a normal distribution.
    • Additional component to keep the volume from expanding

Restarting stale codebook entries.

  • count the number of encoded vectors in a batch that map to a given codebook entry after every iteration
  • replace (randomly from the batch) any codes that have an exponential moving average (EMA) count less than a specified threshold .
    • This value depends on the total batch size , number of tokens per image , and codebook vocabulary size ,
    • Given , then
    • The coefficient means that a codebook entry should appear at least with probability , assuming we have mapped encoded vectors.
    • 4M use .

ViT VQ-VAE with diffusion decoder (RGB, normal, depth, edges)

  • Used in 4M, inspired by DiVAE.
  • Keep the quantization-related losses
  • Replace the reconstruction loss by a diffusion-loss (noise matching + vlb)
  • The decoder can be a small UNet with four down and up layers
    • For improved training and inference efficiency, they process images of shape , similar to Patched Diffusion Models.
      • Patched Diffusion Models propose to reshape an image with shape into an a grid of non-overlapping patches with shape
  • The diffusion decoder is conditioned by the 32-dimensional codebook entries of the 14x14 tokens by concatenating with the noised input.
  • They predict clean image instead of noise, to avoid undesirable color shifts.
  • Diagram

Training details