Evaluation metrics

Perplexity

  • Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. If we have a tokenized sequence , then the perplexity of is,
  • This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions.
  • Disclaimer: Perplexity is affected by 1. tokenizer 2. the vocabulary size 3. the context length. The perplexity for a language model at character-level can be much smaller than perplexity of another model at word-level, it does not mean the character-level language model is better than that of the word-level. We need to report whether the perplexity is at character, subword, or word level.

Cross-entropy

  • Let be the empirical distribution of the language (e.g. internet,). It thus can assign a probability to a sequence of words or characters e.g. β€œI love my cat” some number. The language model learns/models an empirical distribution of language .
  • Cross entropy =
    • = the average number of bits needed to encode any possible outcome of P using the code optimized for P .
    • the number of extra bits required to encode any possible outcome of P using the code optimized for Q.
  • Minimizing the cross entropy minimizes the KL divergence. The lower bound is thus the entropy of , the compressibility of the language.
  • Connection to perplexity