Currently, three dominant paradigms emerge for assessing the performance of IFT models:

Automatic evaluation against references. Metrics such as F1, ROUGE, BERTscore compare the model-generated output to a small set of reference outputs. F1 and ROUGE are based on lexical overlaps, while BERTscore is based on cosine similarity between the generation and reference embeddings.

Human pairwise evaluation. In this case, a human is given a pair of LM generations and asked to select the “better” one and sometimes how much it is better. As an example, Chatbot Arena 4 is an LM battleground where internet users engage in conversations with two LMs before rating them.

LM-judged pairwise evaluation. This method is similar to the one above but uses LM evaluators instead of humans. One such example is AlpacaEval,5 which uses LMs such as GPT-4 to score candidate models by comparing them to a reference model (e.g., text-davinci-003).

Datasets

  • Usually, prompts are taken from “alignment dataset”
  • HELM instruct uses Self Instruct, Koala Eval, Vicuna eval, anthropic red team, anthropic HH, best chatGPT prompts, OASST1

Automatic annotators

Important points to remember
  • how you ask the annotator to rate
    • alpacal-eval-2
      • asks the annotator to order the outputs by preference
      • uses temperature 0
      • randomizes over outputs
      • and made some modifications to the prompt to decrease length bias to 0.68
    • Vicuna/lm-sys
      • asking the annotator a score from 1-10 for each output, .
      • They do not randomize over output order
      • they ask an explanation after the score