🤖 Harold's Notes

Search

❯

❯

❯

❯

Masked modeling

Masked modeling

Jul 03, 20241 min read

Masking

Following ViT [16], we divide an image into regular non-overlapping patches. Then we sample a subset of patches and mask (i.e., remove) the remaining ones.
Reconstruction target. Our MAE reconstructs the input by predicting the pixel values for each masked patch

Encoder

Our encoder is a ViT [16] but applied only on visible, unmasked patches.
Masked patches are removed; no mask tokens are used.

Decoder

The MAE decoder is only used during pre-training to perform the image reconstruction task (only the encoder is used to produce image representations for recognition).
Therefore, the decoder architecture can be flexibly designed in a manner that is independent of the encoder design (e.g. less compute)

Input

The input to the MAE decoder is the full set of tokens consisting of
1. encoded visible patches
2. mask tokens
3. Diagram
Each mask token is a shared, learned vector that indicates the presence of a missing patch to be predicted.
We add positional embeddings to all tokens in this full set; without this, mask tokens would have no information about their location in the image.

Graph View

Backlinks

No backlinks found

Created with Quartz v4.2.3 © 2025