Observations

  • space may be part of a token
  • ”127” may be a single token, while “677” may be two tokens
  • ”Egg” at the beginning of the sentence may be two tokens but ” Egg” may be a single token. Ideally, both should map to the same token but this is not the case.
  • Tokens are longer in English than in other languages. Bloats up the context length.
  • Indentation in Python can bloat up the context length
  • About 100k tokens in GPT-4 tokenizer.

You should start your sequences with a bos token or endoftext token to give attention a resting position when dealing with irrelevant tokens.

BPE

  • The byte pair that occurs the most often in the sequence is replaced by a new token/byte that we append to our vocabulary
  • Repeat the process until the sequence only has individual tokens
  • It is possible to have recursive tokens i.e.

Sentencepiece

  • Works at the unicode level

Tiktokenizer

  • Works at the utf-8 level

Special tokens

  • GPT-2 has 50’257 tokens ⇒ 256 raw bytes + 50’000 merges + <|endoftext|>

Vocab size

  • The high 10ks
  • Sweet spot between underfitting and too long sequence lengths.

Weirdness

  • If a word like “.DefaultCellStyle” is compressed into a single token, it becomes difficult to do character-level manipulation on it.

Multilinguality

  • We should aim for similar sequence lengths for equivalent sentences in Italian, German and French.