Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. If we have a tokenized sequence X=(x0,x1,…,xt)X=(x0,x1,…,xt), then the perplexity of X is, PPL(X)=exp{t−1∑itlogpθ(xi∣x<i)}
This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions.
Disclaimer: Perplexity is affected by 1. tokenizer 2. the vocabulary size 3. the context length. The perplexity for a language model at character-level can be much smaller than perplexity of another model at word-level, it does not mean the character-level language model is better than that of the word-level.We need to report whether the perplexity is at character, subword, or word level.
Cross-entropy
Let P be the empirical distribution of the language (e.g. internet,). It thus can assign a probability to a sequence of words or characters e.g. P(“I love my cat”)= some number. The language model learns/models an empirical distribution of language Q.
Cross entropy = H(P,Q)=EP[−logQ]=H(P)+DKL(P∣∣Q)
H(P) = the average number of bits needed to encode any possible outcome of P using the code optimized for P .
DKL(P∣∣Q)= the number of extra bits required to encode any possible outcome of P using the code optimized for Q.
Minimizing the cross entropy minimizes the KL divergence. The lower bound is thus the entropy of H(P), the compressibility of the language.