🤖 Harold's Notes

Search

❯

❯

❯

❯

Metrics

Dec 06, 20252 min read

Evaluation metrics

https://thegradient.pub/understanding-evaluation-metrics-for-language-models/

Perplexity

Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. If we have a tokenized sequence $X = (x 0, x 1, \dots, x t) X = (x 0, x 1, \dots, x t )$ , then the perplexity of $X$ is, $PP L (X) = e x p {\frac{- 1}{t} \sum_{i}^{t} l o g p_{θ} (x_{i} ∣ x_{< i})}$
This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions.
Disclaimer: Perplexity is affected by 1. tokenizer 2. the vocabulary size 3. the context length. The perplexity for a language model at character-level can be much smaller than perplexity of another model at word-level, it does not mean the character-level language model is better than that of the word-level. We need to report whether the perplexity is at character, subword, or word level.

Cross-entropy

Let $P$ be the empirical distribution of the language (e.g. internet,). It thus can assign a probability to a sequence of words or characters e.g. $P ($ “I love my cat” $) =$ some number. The language model learns/models an empirical distribution of language $Q$ .
Cross entropy = $H (P, Q) = E_{P} [- l o g Q] = H (P) + D_{K L} (P ∣∣ Q)$
- $H (P)$ = the average number of bits needed to encode any possible outcome of P using the code optimized for P .
- $D_{K L} (P ∣∣ Q) =$ the number of extra bits required to encode any possible outcome of P using the code optimized for Q.
Minimizing the cross entropy minimizes the KL divergence. The lower bound is thus the entropy of $H (P)$ , the compressibility of the language.
Connection to perplexity $PP L (P, Q) = 2^{H (P, Q)}$

Graph View

Backlinks

No backlinks found

Created with Quartz v4.2.3 © 2025