https://blog.eleuther.ai/multiple-choice-normalization/

possible pitfalls: https://github.com/huggingface/blog/blob/main/evaluating-mmlu-leaderboard.md

Let be the prompt, and be the th possible continuation with a token length of . There are several ways to use a language model to rank multiple possible continuations to a prompt. Since the language model only gives (log) probabilities for the next token given the context (i.e ), there is ambiguity in handling scoring for arbitrary continuations. The following are several possible ways to resolve this problem:

Unormalized

  • The score of continuation is determined is just the sum of the log-likelihoods
  • Bias towards shorter responses as longer sequences tend to have lower log probabilities

Token-normalized

  • average log probability per token
  • not tokenization independent

Byte-length normalized

  • where is the number of bytes represented by token .

Unconditional likelihood normalized

  • Intuitively, this approach measures the amount that the prompt increases the model’s probability of outputting each continuation from the probability of the model unconditionally producing that continuation. reduces the influence from very likely tokens e.g. β€œthe”.