🤖 Harold's Notes

Search

❯

❯

❯

❯

❯

Text Tokenization

Text Tokenization

Jul 03, 20242 min read

Observations

space may be part of a token
”127” may be a single token, while “677” may be two tokens
”Egg” at the beginning of the sentence may be two tokens but ” Egg” may be a single token. Ideally, both should map to the same token but this is not the case.
Tokens are longer in English than in other languages. Bloats up the context length.
Indentation in Python can bloat up the context length
About 100k tokens in GPT-4 tokenizer.

You should start your sequences with a bos token or endoftext token to give attention a resting position when dealing with irrelevant tokens.

BPE

The byte pair that occurs the most often in the sequence is replaced by a new token/byte that we append to our vocabulary
Repeat the process until the sequence only has individual tokens
It is possible to have recursive tokens i.e. $t_{i} = t_{j} t_{l}$

Sentencepiece

Works at the unicode level

Tiktokenizer

Works at the utf-8 level

Special tokens

GPT-2 has 50’257 tokens ⇒ 256 raw bytes + 50’000 merges + <|endoftext|>

Vocab size

The high 10ks
Sweet spot between underfitting and too long sequence lengths.

Weirdness

If a word like “.DefaultCellStyle” is compressed into a single token, it becomes difficult to do character-level manipulation on it.

Multilinguality

We should aim for similar sequence lengths for equivalent sentences in Italian, German and French.

Graph View

Observations
BPE
Multilinguality

Backlinks

No backlinks found

Created with Quartz v4.2.3 © 2025