They follow guidelines from “Vector-quantized image modeling with improved VQGAN” and “SoundStream: An end-to-end neural audio codec”
Switching from CNNs to ViT
Replace the CNN encoder/decoder by a ViT.
Given sufficient data (for which unlabeled image data is plentiful), ViT VQ-VAE is less constrained by the inductive priors imposed by convolution
Low codebook usage fixes
Vanilla VQ-GANs usually suffer from low codebook usage due to the poor initialization of the codebook.
During training a significant portion of codes are rarely used, or dead.
Can also lead to joint VQ-VAE and diffusion training collapse.
There are three improvements that can significantly encourage the codebook usage even with a larger codebook size of 8192
Factorized codes / reducing the latent space size during lookup
Introduce a linear projection from the output of the encoder to a low-dimensional latent variable space for code index lookup (e.g., reduced from a 768-d vector to a 32-d or 8-d vector per code)
i,e. reduce the latent dimension when calculating nearest neighbour in the latent embedding space vocab. $q(z=k∣x)=1[k=argmin_{j}∣∣P(z_{e}(x))−P(e_{j})∣∣_{2}]$ where $P$ is the linear projector to reduce dimensionality.
$l_{2}$-normalized codes
Apply l2 normalization on the encoded latent variables $z_{e}(x)$ and codebook latent variables $e$. The codebook variables are initialized from a normal distribution.
Additional component to keep the volume from expanding
Restarting stale codebook entries.
count the number of encoded vectors in a batch that map to a given codebook entry after every iteration
replace (randomly from the batch) any codes that have an exponential moving average (EMA) count less than a specified threshold $thresh_{replace}$.
This value depends on the total batch size $B$, number of tokens per image $N_{tokens}$, and codebook vocabulary size $N_{vocab}$,
Given $c_{replace}$, then $thresh_{replace}=c_{replace}N_{vocab}BN_{tokens} $
The coefficient $c_{replace}$ means that a codebook entry should appear at least with probability $c_{replace}1 $, assuming we have mapped $N_{vocab}$ encoded vectors.