Masking
- Following ViT [16], we divide an image into regular non-overlapping patches. Then we sample a subset of patches and mask (i.e., remove) the remaining ones.
- Reconstruction target. Our MAE reconstructs the input by predicting the pixel values for each masked patch
Encoder
- Our encoder is a ViT [16] but applied only on visible, unmasked patches.
- Masked patches are removed; no mask tokens are used.
Decoder
- The MAE decoder is only used during pre-training to perform the image reconstruction task (only the encoder is used to produce image representations for recognition).
- Therefore, the decoder architecture can be flexibly designed in a manner that is independent of the encoder design (e.g. less compute)
Input
-
The input to the MAE decoder is the full set of tokens consisting of
- encoded visible patches
- mask tokens
- Diagram
-
Each mask token is a shared, learned vector that indicates the presence of a missing patch to be predicted.
-
We add positional embeddings to all tokens in this full set; without this, mask tokens would have no information about their location in the image.