Sources

From 1D RoPE to 3D / multimodal RoPE

  • For a given input (text, image, video, audio), we have a corresponding existing hidden feature that exists throughout the forward called

    • The shape of is:
      • for video
      • for image
      • for audio
      • for text
      • Note that the “sequence length” is dependent on the modality.
    • Ultimately, self-attention is a sequence mixing operator that operates on a single dimension, so it expects with shape where is the product of the non-batching dimensions excluding i.e. we flatten .
  • Let’s see how we can inject the necessary positional information in each token before flattening

Vanilla RoPE:
For a token at position , you rotate each pair of channels by an angle . All that matters is that is a scalar index.

3D / M-RoPE idea:
Instead of a single scalar position , we want a 3D coordinate per token:

where:

  • = time (absolute time across the whole conversation / audio / video),

  • = height index in an image/video grid,

  • = width index in an image/video grid.

The rotary embedding then splits the feature dimension into three in order to modulate Q/K:

  • One third of the head encode temporal variation (t),
  • The other third encode vertical structure (h),
  • The last third encode horizontal structure (w).

Text and audio don’t really have spatial structure, so for them they just reuse the same ID for all three axes, effectively collapsing back to 1D RoPE.

Little quirks about aligning time

  • Usually, when dealing with audio, if we use a classic Neural audio codecs or tokenizers, one audio token (abstracting away RVQ) represents a fixed amount of time e.g. 80ms.

    • This is dependent on the implementation of the codec, but in In Mimi or Qwen-Omni, the audio is resampled to 16 kHz and they convert the raw waveform into 128 channel mel-spectrogram with a 25 ms window and a 10 ms hop. and they have 8x temporal downsampling ⇒ 10ms * 8 = 80ms
  • When dealing with videos, the actual time represented a single video frame is dependent on the frame rate per second (FPS).

  • Thus, at the token level, we cannot easily derive time information from sequence position, and the model cannot be aware of how the different modalities were sampled.

  • For this case, the qwen omni paper utilizes absolute temporal encodings i.e. the temporal IDs represent a fixed amount of time e.g. 80ms, usually fixed by the amount of time one audio token represents.

  • Note that here, absolute is just in the sense that the temporal IDs have a meaning and represent time, but we’re still using RoPE.