In the Cosmos tokenizer, they say

”Our tokenizer operates in the wavelet space, where inputs are first processed by a 2-level wavelet transform. Specifically, the wavelet transform maps the input video in a group-wise manner to downsample the inputs by a factor of 4 along and . The groups are formed as: {, , , …, } β†’ {}.”

Let’s unpack that step by step.


1. Start small: 1D Haar wavelet (on a sequence)

Take 4 numbers in a row, say pixel intensities along a line:

A 1D Haar wavelet transform does two things:

  1. Averages neighbors (low-frequency / smooth info)
  2. Differences neighbors (high-frequency / detail info)

Level 1:

  • Averages:
  • Details:

So 4 numbers β†’ still 4 numbers:

But:

  • The L’s live on a downsampled grid (2 positions instead of 4).
  • The H’s tell you how much local variation you lost when averaging.

2-level 1D Haar

Now do another Haar transform on the L’s:

  • Take and compute:
    • new average:
    • new detail:

So overall you get 4 coefficients:

  • : very low frequency (coarsest)
  • : medium-scale detail
  • : fine-scale details

Notice:

  • Input length 4 β†’ in time you end up with 1 position at the coarsest scale.
  • That’s a downsampling by factor 4 in time, but you keep information in extra channels (’s).

2. Extend to images: 2D Haar wavelet

For 2D (an image), you do this separately along x and y (separable transform):

  1. Apply 1D Haar on rows (x direction).
  2. Apply 1D Haar on columns (y direction) of both low and high parts.

This gives you 4 subbands per level:

  • LL (low in x, low in y) – smooth image
  • LH (low x, high y) – vertical edges
  • HL (high x, low y) – horizontal edges
  • HH (high x, high y) – diagonal edges

Again, the LL part is downsampled, and you recurse on LL for multiple levels.


3. Now video: 3D Haar (x, y, t)

A video is a 3D signal:

  • x: width
  • y: height
  • t: time (frames)

A 3D Haar transform just applies the same idea along all three dimensions:

  1. 1D Haar along x
  2. 1D Haar along y
  3. 1D Haar along t

(because it’s separable, you can do this in any order).

For a 2-level 3D transform:

  • Along each axis (x, y, t), you downsample by 2 per level.
  • 2 levels β‡’ downsample by along each dimension.

So:

  • Original grid:
  • After 2 levels of 3D Haar:

BUT each voxel now has 4 channels, so information is not simply lost; it’s redistributed as multi-scale coefficients.