ViT (Vision Transformers)

They split an image into fixed-size patches,
linearly embed each of them,
add position embeddings,
feed the resulting sequence of vectors to a standard Transformer encoder.
In order to perform classification, they use the standard approach of adding an extra learnable “classification token” to the sequence.
Diagram

Patchify + getting token sequence

To do the patchify + linear projection, you can define patch_embedding = nn.Conv2d(in_channels=config.num_channels,out_channels=self.embed_dim,kernel_size=self.patch_size, stride=self.patch_size, bias=False,)
Then
- patch_embeds= patch_embedding(pixel_values).flatten(2)
This way, you get a token sequence dependent on resolution.

Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

Inspired by example packing in NLP, where multiple examples are packed into a single sequence to accommodate efficient training on variable length inputs
Multiple patches from different images are packed in a single sequence— termed Patch n’ Pack —which enables variable resolution while preserving the aspect ratio
Diagram

Positional embeddings

To handle arbitrary resolutions and aspect ratios, we need to revisit the position embeddings.
Vanilla ViT Given square images of resolution R×R, a vanilla ViT with patch size P learns 1-D positional embeddings of length $(R / P)^{2}$ . Linearly interpolating these embeddings is necessary to train or evaluate at higher resolution R.
Pix2struct introduces learned 2D absolute positional embeddings, whereby positional embeddings of size $[ma xL e n, ma xL e n]$ are learned, and indexed with (x, y) coordinates of each patch. This enables variable aspect ratios, with resolutions of up to $R = P \cdot ma xL e n$ . However, every combination of (x, y) coordinates must be seen during training
Factorized & fractional positional embeddings. (NaViT)
- To support variable aspect ratios and readily extrapolate to unseen resolutions, they introduce factorized positional embeddings, where we decompose into separate embeddings $ϕ_{x}$ and $ϕ_{y}$ of x and y coordinates.
- These are then summed together (alternative combination strategies explored in Section 3.4)
- They consider two schemas:
  - absolute embeddings, where $ϕ (p) : [0, ma xL e n] \to R^{D}$ is a function of the absolute patch index
  - fractional embeddings, where $ϕ (r) : [0, 1] \to R^{D}$ is a function of r = p/side-length, that is, the relative distance along the image
    - This provides positional embedding parameters independent of the image size, but partially obfuscates the original aspect ratio, which is then only implicit in the number of patches
  - For such functions, they consider simple learned embeddings $ϕ$ , sinusoidal embeddings, and the learned Fourier positional embedding used by NeRF
- Factorized position embeddings improve generalization to new resolutions and aspect ratios

🤖 Harold's Notes

Explorer

ViT (Vision Transformers)

Patchify + getting token sequence

Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

Positional embeddings

Graph View

Table of Contents

Backlinks