3D Haar Wavelet Transform

In the Cosmos tokenizer, they say

”Our tokenizer operates in the wavelet space, where inputs are first processed by a 2-level wavelet transform. Specifically, the wavelet transform maps the input video $x_{0 : T}$ in a group-wise manner to downsample the inputs by a factor of 4 along $x, y$ and $t$ . The groups are formed as: { $x_{0}$ , $x_{1 : 4}$ , $x_{5 : 8}$ , …, $x_{(T - 3) : T}$ } → { $g_{0}, g_{1}, g_{2}, ..., g_{T /4}$ }.”

Let’s unpack that step by step.

1. Start small: 1D Haar wavelet (on a sequence)

Take 4 numbers in a row, say pixel intensities along a line:

$[a_{0}, a_{1}, a_{2}, a_{3}]$

A 1D Haar wavelet transform does two things:

Averages neighbors (low-frequency / smooth info)
Differences neighbors (high-frequency / detail info)

Level 1:

Averages:
$L_{0} = \frac{a _{0} + a _{1}}{2}, L_{1} = \frac{a _{2} + a _{3}}{2}$
Details:
$H_{0} = \frac{a _{0} - a _{1}}{2}, H_{1} = \frac{a _{2} - a _{3}}{2}$

So 4 numbers → still 4 numbers: $[L_{0}, L_{1}, H_{0}, H_{1}]$

But:

The L’s live on a downsampled grid (2 positions instead of 4).
The H’s tell you how much local variation you lost when averaging.

2-level 1D Haar

Now do another Haar transform on the L’s:

Take $L_{0}, L_{1}$ and compute:
- new average: $LL = \frac{L _{0} + L _{1}}{2}$
- new detail: $L H = \frac{L _{0} - L _{1}}{2}$

So overall you get 4 coefficients:

$LL$ : very low frequency (coarsest)
$L H$ : medium-scale detail
$H_{0}, H_{1}$ : fine-scale details

Notice:

Input length 4 → in time you end up with 1 position at the coarsest scale.
That’s a downsampling by factor 4 in time, but you keep information in extra channels ( $LL, L H, H$ ’s).

2. Extend to images: 2D Haar wavelet

For 2D (an image), you do this separately along x and y (separable transform):

Apply 1D Haar on rows (x direction).
Apply 1D Haar on columns (y direction) of both low and high parts.

This gives you 4 subbands per level:

LL (low in x, low in y) – smooth image
LH (low x, high y) – vertical edges
HL (high x, low y) – horizontal edges
HH (high x, high y) – diagonal edges

Again, the LL part is downsampled, and you recurse on LL for multiple levels.

3. Now video: 3D Haar (x, y, t)

A video is a 3D signal:

x: width
y: height
t: time (frames)

A 3D Haar transform just applies the same idea along all three dimensions:

1D Haar along x
1D Haar along y
1D Haar along t

(because it’s separable, you can do this in any order).

For a 2-level 3D transform:

Along each axis (x, y, t), you downsample by 2 per level.
2 levels ⇒ downsample by $2^{2} = 4$ along each dimension.

So:

Original grid: $T \times H \times W$
After 2 levels of 3D Haar: $(T /4) \times (H /4) \times (W /4)$

BUT each voxel now has 4 channels, so information is not simply lost; it’s redistributed as multi-scale coefficients.

🤖 Harold's Notes

Explorer

3D Haar Wavelet Transform

1. Start small: 1D Haar wavelet (on a sequence)

2-level 1D Haar

2. Extend to images: 2D Haar wavelet

3. Now video: 3D Haar (x, y, t)

Graph View

Table of Contents

Backlinks