Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. If we have a tokenized sequence X=(x0,x1,β¦,xt)X=(x0β,x1β,β¦,xtβ), then the perplexity of X is, PPL(X)=expβ‘{tβ1ββitβlogpΞΈβ(xiββ£x<iβ)}
This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions.
Disclaimer: Perplexity is affected by 1. tokenizer 2. the vocabulary size 3. the context length. The perplexity for a language model at character-level can be much smaller than perplexity of another model at word-level, it does not mean the character-level language model is better than that of the word-level.We need to report whether the perplexity is at character, subword, or word level.
Cross-entropy
Let P be the empirical distribution of the language (e.g. internet,). It thus can assign a probability to a sequence of words or characters e.g. P(βI love my catβ)= some number. The language model learns/models an empirical distribution of language Q.