In the Cosmos tokenizer, they say
βOur tokenizer operates in the wavelet space, where inputs are first processed by a 2-level wavelet transform. Specifically, the wavelet transform maps the input video in a group-wise manner to downsample the inputs by a factor of 4 along and . The groups are formed as: {, , , β¦, } β {}.β
Letβs unpack that step by step.
1. Start small: 1D Haar wavelet (on a sequence)
Take 4 numbers in a row, say pixel intensities along a line:
A 1D Haar wavelet transform does two things:
- Averages neighbors (low-frequency / smooth info)
- Differences neighbors (high-frequency / detail info)
Level 1:
- Averages:
- Details:
So 4 numbers β still 4 numbers:
But:
- The Lβs live on a downsampled grid (2 positions instead of 4).
- The Hβs tell you how much local variation you lost when averaging.
2-level 1D Haar
Now do another Haar transform on the Lβs:
- Take and compute:
- new average:
- new detail:
So overall you get 4 coefficients:
- : very low frequency (coarsest)
- : medium-scale detail
- : fine-scale details
Notice:
- Input length 4 β in time you end up with 1 position at the coarsest scale.
- Thatβs a downsampling by factor 4 in time, but you keep information in extra channels (βs).
2. Extend to images: 2D Haar wavelet
For 2D (an image), you do this separately along x and y (separable transform):
- Apply 1D Haar on rows (x direction).
- Apply 1D Haar on columns (y direction) of both low and high parts.
This gives you 4 subbands per level:
- LL (low in x, low in y) β smooth image
- LH (low x, high y) β vertical edges
- HL (high x, low y) β horizontal edges
- HH (high x, high y) β diagonal edges
Again, the LL part is downsampled, and you recurse on LL for multiple levels.
3. Now video: 3D Haar (x, y, t)
A video is a 3D signal:
- x: width
- y: height
- t: time (frames)
A 3D Haar transform just applies the same idea along all three dimensions:
- 1D Haar along x
- 1D Haar along y
- 1D Haar along t
(because itβs separable, you can do this in any order).
For a 2-level 3D transform:
- Along each axis (x, y, t), you downsample by 2 per level.
- 2 levels β downsample by along each dimension.
So:
- Original grid:
- After 2 levels of 3D Haar:
BUT each voxel now has 4 channels, so information is not simply lost; itβs redistributed as multi-scale coefficients.