Observations
- space may be part of a token
- ”127” may be a single token, while “677” may be two tokens
- ”Egg” at the beginning of the sentence may be two tokens but ” Egg” may be a single token. Ideally, both should map to the same token but this is not the case.
- Tokens are longer in English than in other languages. Bloats up the context length.
- Indentation in Python can bloat up the context length
- About 100k tokens in GPT-4 tokenizer.
You should start your sequences with a bos token or endoftext token to give attention a resting position when dealing with irrelevant tokens.
BPE
- The byte pair that occurs the most often in the sequence is replaced by a new token/byte that we append to our vocabulary
- Repeat the process until the sequence only has individual tokens
- It is possible to have recursive tokens i.e.
Sentencepiece
- Works at the unicode level
Tiktokenizer
- Works at the utf-8 level
Special tokens
- GPT-2 has 50’257 tokens ⇒ 256 raw bytes + 50’000 merges + <|endoftext|>
Vocab size
- The high 10ks
- Sweet spot between underfitting and too long sequence lengths.
Weirdness
- If a word like “.DefaultCellStyle” is compressed into a single token, it becomes difficult to do character-level manipulation on it.
Multilinguality
- We should aim for similar sequence lengths for equivalent sentences in Italian, German and French.