Summary

Currently, three dominant paradigms emerge for assessing the performance of IFT models:

Automatic evaluation against references. Metrics such as F1, ROUGE, BERTscore compare the model-generated output to a small set of reference outputs. F1 and ROUGE are based on lexical overlaps, while BERTscore is based on cosine similarity between the generation and reference embeddings.

Human pairwise evaluation. In this case, a human is given a pair of LM generations and asked to select the “better” one and sometimes how much it is better. As an example, Chatbot Arena 4 is an LM battleground where internet users engage in conversations with two LMs before rating them.

LM-judged pairwise evaluation. This method is similar to the one above but uses LM evaluators instead of humans. One such example is AlpacaEval,5 which uses LMs such as GPT-4 to score candidate models by comparing them to a reference model (e.g., text-davinci-003).

Datasets

Usually, prompts are taken from “alignment dataset”
HELM instruct uses Self Instruct, Koala Eval, Vicuna eval, anthropic red team, anthropic HH, best chatGPT prompts, OASST1

Automatic annotators

Important points to remember

how you ask the annotator to rate
- alpacal-eval-2
  - asks the annotator to order the outputs by preference
  - uses temperature 0
  - randomizes over outputs
  - and made some modifications to the prompt to decrease length bias to 0.68
- Vicuna/lm-sys
  - asking the annotator a score from 1-10 for each output, .
  - They do not randomize over output order
  - they ask an explanation after the score

🤖 Harold's Notes

Explorer

Summary

Datasets

Automatic annotators

Important points to remember

Graph View

Table of Contents

Backlinks