VQ-VAE

VQ-VAE is a type of variational autoencoder that uses vector quantisation to obtain a discrete latent representation.
- It differs from VAEs in two key ways:
  - the encoder network outputs discrete, rather than continuous, codes
  - the prior is learnt rather than static
In order to learn a discrete latent representation, ideas from vector quantisation (VQ) are incorporated.
- Using the VQ method allows the model to circumvent issues of posterior collapse - where the latents are ignored when they are paired with a powerful autoregressive decoder - typically observed in the VAE framework.

How it works

Define a latent embedding space $e \in R^{K \times D}$ ,
- where $K$ is the size of the discrete latent space (i.e. K-way categorical, size of vocab)
- $D$ is the embedding size.
- Thus, $z$ is K-way categorical variable
First-step Encoder $z_{e}$
- Takes an input $x$ , and outputs an embedding $z_{e} (x) \in R^{K \times D}$
Posterior categorical distribution $q (z ∣ x)$
- Obtained by calculating nearest neighbour in the latent embedding space vocab. $q (z = k ∣ x) = 1 [k = argmin_{j} ∣∣ z_{e} (x) - e_{j} ∣ ∣_{2}]$
- Outputs either 0 or 1
- It’s deterministic
Prior $p (z)$
- If you choose uniform prior over $z$ , then you obtain a constant KL divergence, and equal to $log K$
- No posterior collapse
What is actually fed to the decoder, it’s $z_{q} (x)$
- Just the indexing of the embedding table, given the discretized representation of $x$
- $z_{q} (x) = e_{k}$ where $k = argmin_{j} ∣∣ z_{e} (x) - e_{j} ∣ ∣_{2}$

Because of the usage of argmin, and nearest neighbour, there is no real gradient defined between encoder and decoder.
- They approximate the gradient similar to the straight-through estimator and just copy gradients from decoder input $z_{q} (x)$ to encoder output $z_{e} (x)$ .
- Why ?
  - During forward computation the nearest embedding $z_{q} (x)$ is passed to the decoder, and during the backwards pass the gradient $\nabla_{z} L$ is passed unaltered to the encoder.
  - Since the output representation of the encoder and the input to the decoder share the same $D$ dimensional space, the gradients contain useful information for how the encoder has to change its output to lower the reconstruction loss.