They use packed sequences in SigLipV2 (similar to NaViT), thus they use varlen flash attention
The positional embeddings are a fixed 2D grid of size (position_embedding_size, position_embedding_size) (each point in the grid has an embed_dim vector). position_embedding_size = 16 in the config.
Then for a given image of shape (H_patch,W_patch), they inteporlate the 2D grid positional embeddings to the correct shape.
They use a pixel_shuffle_scale_factor=2 or r=2 to reduce the number of image tokens by 4.
Why? We’d like the LLM to see fewer vision tokens (to reduce TTT) without throwing away information that the vision encoder already extracted. Conventional fixes (avg/MaxPool, strided conv, or token dropping) discard information up front.
It lets the vision encoder operate at high resolution (so attention/mixing happens at fine scale).
After the vision encoder, it merges local r×r neighborhoods into one token by moving information from space into channels. This is the “pixel shuffle” (a.k.a. space-to-depth) over tokens.
Net effect:
Sequence length for vision tokens shrinks by r2.
Per-token channel dimensionality grows by r2.
No information is intrinsically lost at the shuffle itself (it’s a permutation/rearrangement). Any compression is learned in the projector that follows