The reward model takes a model response and its corresponding prompt (including contexts from previous turns) as inputs and outputs a scalar score to indicate the quality (e.g., helpfulness and safety) of the model generation.
Two reward models Others have found that helpfulness and safety sometimes trade off (Bai et al., 2022a), which can make it challenging for a single reward model to perform well on both. To address this, we train two separate reward models, one optimized for helpfulness (referred to as Helpfulness RM) and another for safety (Safety RM).
Initialized from base model ensures that both models benefit from knowledge acquired in pretraining.
Same architecture The model architecture and hyper-parameters are identical to those of the pretrained language models, except that the classification head for next-token prediction is replaced with a regression head for outputting a scalar reward.
Training objective
To train the reward model, we convert our collected pairwise human preference data into a binary ranking label format (i.e., chosen & rejected) and enforce the chosen response to have a higher score than its counterpart.
Additionally, in LLama2, because they have different scale of preferences (significantly better/better/…), they teach the reward model to assign more discrepant scores to the generations that have more differences by using a margin (i.e. the reward model must assign score such that the difference is greater than the margin).