Sources

FPS-aware 3D RoPE
- cosmos https://arxiv.org/pdf/2501.03575
TMRoPE (Time-aligned Multimodal RoPE)
- qwen 2.5 omni https://arxiv.org/pdf/2503.20215
- qwen3 omni https://arxiv.org/pdf/2509.17765

From 1D RoPE to 3D / multimodal RoPE

For a given input $x$ (text, image, video, audio), we have a corresponding existing hidden feature that exists throughout the forward called $z$
- The shape of $z$ is:
  - $(T, H, W, D)$ for video
  - $(1, H, W, D)$ for image
  - $(T, D)$ for audio
  - $(T_{t}, D)$ for text
  - Note that the “sequence length” is dependent on the modality.
- Ultimately, self-attention is a sequence mixing operator that operates on a single dimension, so it expects $z$ with shape $(S, D)$ where $S$ is the product of the non-batching dimensions excluding $D$ i.e. we flatten $z$ .
Let’s see how we can inject the necessary positional information in each token before flattening $z$

Vanilla RoPE:
For a token at position $p$ , you rotate each pair of channels $(x_{2 i}, x_{2 i + 1})$ by an angle $θ_{i} (p)$ . All that matters is that $p$ is a scalar index.

3D / M-RoPE idea:
Instead of a single scalar position $p$ , we want a 3D coordinate per token:

$pos = (t, h, w)$

where:

$t$ = time (absolute time across the whole conversation / audio / video),

$h$ = height index in an image/video grid,

$w$ = width index in an image/video grid.

The rotary embedding then splits the feature dimension into three in order to modulate Q/K:

One third of the head encode temporal variation (t),
The other third encode vertical structure (h),
The last third encode horizontal structure (w).

Text and audio don’t really have spatial structure, so for them they just reuse the same ID for all three axes, effectively collapsing back to 1D RoPE.

Little quirks about aligning time

Usually, when dealing with audio, if we use a classic Neural audio codecs or tokenizers, one audio token (abstracting away RVQ) represents a fixed amount of time e.g. 80ms.
- This is dependent on the implementation of the codec, but in In Mimi or Qwen-Omni, the audio is resampled to 16 kHz and they convert the raw waveform into 128 channel mel-spectrogram with a 25 ms window and a 10 ms hop. and they have 8x temporal downsampling ⇒ 10ms * 8 = 80ms
When dealing with videos, the actual time represented a single video frame is dependent on the frame rate per second (FPS).
Thus, at the token level, we cannot easily derive time information from sequence position, and the model cannot be aware of how the different modalities were sampled.
For this case, the qwen omni paper utilizes absolute temporal encodings i.e. the temporal IDs represent a fixed amount of time e.g. 80ms, usually fixed by the amount of time one audio token represents.
Note that here, absolute is just in the sense that the temporal IDs have a meaning and represent time, but we’re still using RoPE.

🤖 Harold's Notes

Explorer

3D positional embeddings

Sources

From 1D RoPE to 3D / multimodal RoPE

Little quirks about aligning time

Graph View

Table of Contents

Backlinks