Masking

  • Following ViT [16], we divide an image into regular non-overlapping patches. Then we sample a subset of patches and mask (i.e., remove) the remaining ones.
  • Reconstruction target. Our MAE reconstructs the input by predicting the pixel values for each masked patch

Encoder

  • Our encoder is a ViT [16] but applied only on visible, unmasked patches.
  • Masked patches are removed; no mask tokens are used.

Decoder

  • The MAE decoder is only used during pre-training to perform the image reconstruction task (only the encoder is used to produce image representations for recognition).
  • Therefore, the decoder architecture can be flexibly designed in a manner that is independent of the encoder design (e.g. less compute)
Input
  • The input to the MAE decoder is the full set of tokens consisting of

    1. encoded visible patches
    2. mask tokens
    3. Diagram
  • Each mask token is a shared, learned vector that indicates the presence of a missing patch to be predicted.

  • We add positional embeddings to all tokens in this full set; without this, mask tokens would have no information about their location in the image.