Image Tokenization

  • They train a new image tokenizer based on Gafni et al. (2022), which encodes a 512 × 512 image into 1024 discrete tokens from a codebook of size 8192.
  • 16x16 patches which produce 512/16 * 512/16 = 1024 discrete tokens.

Tricks

  • Given the importance of generating human faces, they up-sample the percentage of images with faces during pre-training by 2 times.
  • A core weakness of our tokenizer is in reconstructing images with a large amount of text, therefore upper bounding the capability of our models, when it comes to heavy OCR-related tasks.