High-level

(Text with <image> tokens, PIL images)
        │
        ▼
   IsaacProcessor
   ├─ tokenize text                      (AutoTokenizer)
   ├─ resize/normalize/patchify images   (vision params)
   └─ build TensorStream of Events       (text + vision)
        │
        ▼
    IsaacModel (Qwen3Model subclass)
    ├─ embed_stream()
    │  ├─ Text: embedding matrix
    │  └─ Vision: Siglip2SequenceVisionTransformer
    │        ├─ Variable-Sequence patch embeddings + 2D pos
    │        ├─ Variable-Length self-attn (FlashAttention)
    │        ├─ Post-LN + (optional) PixelShuffle (varlen)
    │        └─ MLP projector → language hidden size
    ├─ Build modality mask + 3D MRoPE (cos/sin)
    └─ Qwen-3 decoder layers (causal, position-aware)
        │
        ▼
 IsaacForConditionalGeneration
 └─ lm_head → logits (optionally label loss)

  • They use packed sequences in SigLipV2 (similar to NaViT), thus they use varlen flash attention
  • The positional embeddings are a fixed 2D grid of size (position_embedding_size, position_embedding_size) (each point in the grid has an embed_dim vector). position_embedding_size = 16 in the config.
  • Then for a given image of shape (H_patch,W_patch), they inteporlate the 2D grid positional embeddings to the correct shape.
resized_pos_embed = F.interpolate(
	positional_embeddings,
	size=(height, width),
	mode=mode,
	align_corners=align_corners,
	antialias=antialias,
)
  • They use a pixel_shuffle_scale_factor=2 or to reduce the number of image tokens by 4.
    • Why? We’d like the LLM to see fewer vision tokens (to reduce TTT) without throwing away information that the vision encoder already extracted. Conventional fixes (avg/MaxPool, strided conv, or token dropping) discard information up front.
    • It lets the vision encoder operate at high resolution (so attention/mixing happens at fine scale).
    • After the vision encoder, it merges local neighborhoods into one token by moving information from space into channels. This is the “pixel shuffle” (a.k.a. space-to-depth) over tokens.
    • Net effect:
      • Sequence length for vision tokens shrinks by .
      • Per-token channel dimensionality grows by .
      • No information is intrinsically lost at the shuffle itself (it’s a permutation/rearrangement). Any compression is learned in the projector that follows
    • Implementation
grid = grid.view(h, w)  # (H, W)
grid = grid.view(h, w // r, r)  # (H, W/r, r)
grid = grid.view(h // r, r, w // r, r)  # (H/r, r, W/r, r)
grid = grid.permute(0, 2, 1, 3).contiguous()