🤖 Harold's Notes

Search

❯

❯

❯

Other Architectures

❯

ViT (Vision Transformers)

ViT (Vision Transformers)

Jul 03, 20241 min read

They split an image into fixed-size patches,
linearly embed each of them,
add position embeddings,
feed the resulting sequence of vectors to a standard Transformer encoder.
In order to perform classification, they use the standard approach of adding an extra learnable “classification token” to the sequence.
Diagram

Patchify + getting token sequence

To do the patchify + linear projection, you can define patch_embedding = nn.Conv2d(in_channels=config.num_channels,out_channels=self.embed_dim,kernel_size=self.patch_size, stride=self.patch_size, bias=False,)
Then
- patch_embeds= patch_embedding(pixel_values).flatten(2)
This way, you get a token sequence dependent on resolution.

Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

Inspired by example packing in NLP, where multiple examples are packed into a single sequence to accommodate efficient training on variable length inputs
Multiple patches from different images are packed in a single sequence— termed Patch n’ Pack —which enables variable resolution while preserving the aspect ratio
Diagram

Graph View

Patchify + getting token sequence
Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

Backlinks

No backlinks found

Created with Quartz v4.2.3 © 2024