Resources:
- https://lucalp.dev/bitter-lesson-tokenization-and-blt/
- https://goombalab.github.io/blog/2025/tradeoffs/
- H-Nets - the Past
- H-Nets - the Future
- https://github.com/goombalab/hnet
- https://arxiv.org/abs/2507.07955
Motivation to go beyond BPE
-
not learned end-to-end, not bitter
-
the token length of a sequence is independent of its actual semantic content
- i.e. âThe cat is eatingâ and âAAAAAA_BBBBBBâ may have the same token length, but intuitively information theory tells you that the second sequence could be compressed much better
-
âNoisyâ content e.g. âThe cat, hmmm, is , uhh, eatingâ is not âfilteredâ out, and will affect the downstream performance of the transformer
-
We want dynamic chunking of the input
- To learn this, we need a a differentiable chunking mechanism
- How to do this? Grouping discrete units of data together is a discrete selection problem, which is generally very hard for differentiable optimization
- If you think about it, MoE (where each token gets âroutedâ to a fixed set of k âexpertsâ) is essentially a discrete selection problem
-
Sadly, itâs much harder than just straight up applying MoE
- lots of confusing training behaviours and instabilities
- They tried to figure out how to improve signal propagation into the router weights, including trying techniques like
- norm layers, straight-through estimator, and Gumbel noise.
- per-network norms idea from the paper, you have to properly rescale the initialization of each layer accordingly to balance (not very well mentioned in the paper)
H-Net (Dynamic Chunking for End-to-End Hierarchical Sequence Modeling)
Hierarchical Processing.
-
The H-Net adopts the hierarchical architecture resembling an autoregressive U-Net
- (i) raw data is processed by a small encoder network
- (ii) then downsampled and passed through a main network operating on compressed chunks,
- (iii) and finally upsampled before being passed through a decoder network operating on the original resolution.
-
This modularity creates a natural processing hierarchy where outer stages capture fine-grained patterns while inner stages operate on coarse representations akin to traditional tokens.
-
Crucially, while the main network contains the bulk of parameters and can be any standard architecture designed for operating on tokenized languageâsuch as a Transformer or state space model (SSM).
-
They show that the encoder and decoder networks are strongly improved by using SSMs, which have an inductive bias for compression.
Notation
-
The recursive design allows H-Net to scale to arbitrary depths. In an -stage model, we denote components at each stage using superscripts:
- encoder networks as and decoder networks as for stages , with the main network residing only at the final stage đ = đ.
-
The overall pipeline can be formalized as:
- \hat{x}^s = \mathcal{E}^s(x^s), \quad \hat{z}^s = \mathcal{M}(x^s), \quad \hat{z}^s = \mathcal{D}^s(z^s), \tag{1}
-
where the chunking layer and the dechunking layer operations are defined as:
- (x^{s+1}, p^s) = \text{Chunk}(\hat{x}^s), \tag{2}
- z^s = \text{Dechunk}(z^{s+1}, p^s) + \text{Linear}(\hat{x}^s). \tag{3}
-
The initial input to the model is where is the input sequence length and is the embedding dimension.
-
Intuitively, represents the chunking routerâs confidence that the token should be passed into the main stage. This value is essential for both the chunk and dechunk operations.
Diagram

Dynamic chunking
- H-Netâs core is a novel dynamic chunking (DC) mechanism which interfaces between the main network and the encoder/decoder networks, learning how to segment data while using standard differentiable optimization.
- DC is composed of two complementary new techniques:
- (i) a routing module which predicts boundaries between adjacent elements through a similarity score (in the chunking)
- (ii) a smoothing module which interpolates representations using the routerâs outputs, attenuating the effect of uncertain boundaries and significantly improving learnability. (in the dechunking)
- By combining these with a new auxiliary loss function that targets desired downsampling ratios, and modern techniques for gradient-based learning of discrete choices (Switch Transformers and STE), DC lets an H-Net learn how to compress data in a fully end-to-end fashion.
Signal Propagation
- They introduce several architectural and training techniques to improve stability and scalability during end-to-end optimization.
- (i) carefully placing projections and normalization layers to balance signal propagation between interacting sub-networks,
- (ii) adjusting optimization parameters for each layer based on its dimensionality and effective batch size, which changes between stages of the hierarchical structure (sounds like Spectral mup)
Further thoughts about H-Net and possible improvements
Possible improvements to H-Net
-
Architecture components: the final architecture included post-norm layers at the end of every sub-network, as well as linear projections on the residual connection. They spent a long time trying to remove these since they were non-standard techniques (most U-Nets donât have these) that complicated the model. But in the end, they seem very usefulâmaybe essential.
-
Sparsity auxiliary loss: The auxiliary loss targets a specific down-sampling ratio. This seems a little artificial and introduces an annoying new hyperparameter (thankfully, the only one!). It would be cleaner to simply impose a âsparsity lossâ that encouraged higher compression rates (it would be counter-balanced by the main loss function which encourages more compute, hence lower compression rates). Some things sort of worked but it didnât seem as consistent, so for this version of the H-Net, they kept the targeted âratio lossâ.
-
Chunking mechanisms: They tried many different variations of the routing module, upsampling step, downsampling step, smoothing module, and every other component, but were able to report a small subset in the ablations
-
Layer allocation: Normal LLM scaling laws might be concerned with how to scale the width and depth of the model as the parameter count grows. But here, there are so many more choices for how many layers to put in each sub-network and how wide to make them and so on. Beyond that, the question of which layers to use (e.g. Transformer vs. Mamba) in each network is also not obvious, and required tons of experiments to understand.
Using SSMs vs Transformers in H-Net (Compression as a feature)
-
The question is âIs compression a bug or a feature?â.
- SSMs perhaps hidden strengths due to their compressive abilities.
-
Letâs compare the inductive bias of SSMs and Transformers, for example, by contrasting these two facts:
- Transformers and SSMs have similar performance on tokenized language (with caveats of recall)
- Transformers seriously underperform SSMs on untokenized language
- The intuitive explanation is that on high-resolution data without meaning (such as characters), attention is a poor inductive bias, and understanding the data requires compressing it
- Ablations in the H-Net paper corroborate this: any parts of the model interacting directly with byte-level resolution strongly benefit from SSM layers!!
This raises a natural question though: is the importance of SSM layers because
- they are simply better at processing fine-grained byte inputs, as we already knew?
- or because they are better for compressing information into the next stage, even if given coarser inputs?
We can disentangle these two hypotheses by simply applying an H-Net on data thatâs not byte-level.
This figure shows a 1-stage H-Net trained on top of standard BPE-tokenized inputs, with T2M2-M2T2 for example denoting 2 Transformer layers and 2 Mamba layers in the encoder (and the reverse in the decoder). Note that Transformer layers are 2x the size of Mamba layers here, so everything is parameter/compute matched.
This figure shows that itâs indeed the second hypothesis that holds. Maybe SSMs really are doing something interesting that other models canât do, and that perhaps compression is fundamental to intelligence.
Speculative decoding resembles an H-Net
Without getting too in the weeds, speculative decoding consists of a large model (usually called the âverification modelâ) that we want to sample from, and a small model (usually called the âdraft modelâ) thatâll help us sample from it faster. The decoding process basically looks like this:
- On every autoregressive step, the draft model will take a step to generate a new token.
- Every few steps, the verification model will verify the small modelâs sequence of tokens, accepting as many of them as it can.
At a high level, specdec improves generation speed by letting the large model only do a forward pass every few tokens. But this is incredibly similar to the decoding process of an H-Net!
- The H-Net encoder/decoder networks take a step on every token.
- The H-Net main network takes a step every few tokens (on every chunk).
Engineering challenges
Training
Training is more difficult than normal because sequences are dynamically subsampled, which causes load balance issues among other edge cases. You can engineer the pipeline to be reasonably efficient by incorporating dynamic packing and such. The current implementation is still a bit slower than isotropic models during training, but IƔe expect to have substantial room for improvement. There has been a lot of work on mixture-of-experts (MoE) in the last few years, and I expect a lot of general ideas will transfer to H-Nets.
Hybrid models
-
Hybrid models combining linear layers with quadratic attention have become much more popular, Is the simple interleaving strategies the most natural way?
-
One nice thing about H-Nets is that they can hybridize linear and quadratic layers in a more elegant way
- Linear layers go on the outside, both for efficiency and modeling reasons (as covered in Albert Guâs Tradeoffs post)
- powerful quadratic attention layers can go on the inside, operating over higher levels of abstraction where they are most suited
-
Figuring out the exact right combination of layers is pretty non-trivial
-
For example, these were the conclusions found for a 2-stage H-Net (three sequence lengths):
- Outer: Pure Mamba layers perform best, and seem indispensable.
- Middle: After the outer layers have shrunk the sequences by a reasonable length (almost 3 times), this is much closer to tokenized language.
- Wouldnât have a been a surprise if pure Transformer layers were fine here. But Mamba was still important, which validates that its effect is not just because itâs good at high resolution, but because itâs doing a form of active compression that benefits dynamic chunking.
- Inner: The innermost model has the most parameters and is essentially a standard isotropic language model operating on coarsely tokenized data (but with better âtokensâ that are dynamic and learned from data!).
- In the paper, they stuck to pure Transformers because that was the main baseline.
- However, this is completely orthogonal to the rest of the H-Net design; threâs an ablation showing that general findings for LM architectures still transfer, such as that hybrid main networks (3-to-1 Mamba-to-Transformer) still have somewhat better perplexity !
