We sample K outputs from the model and select the best candidate with our reward.
We use the selected outputs for a gradient update. For each prompt, the sample obtaining the highest reward score is considered the new gold standard. Similar to Scialom et al. (2020a), we then fine-tune our model on the new set of ranked samples, reinforcing the reward.
Proximal Policy Optimization (PPO) in Llama2
We further train our language model following the RL scheme of Stiennon et al. (2020), which uses the reward model as an estimate for the true reward function (human preference) and the pretrained language model as the policy to optimize.
πargmaxEp∼D,g∼π[R(g∣p)]
We iteratively improve the policy by sampling prompts p from our dataset D and generations g from the policy π and use the PPO algorithm and loss function to achieve this objective.
The reward function: R(g∣p)=R~c(g∣p)−βDKL(πθ(g∣p)∣∣π0(g∣p))
where there is a penalty term for diverging from the original policy π0. As was observed in other works (Stiennon et al., 2020; Ouyang et al., 2022), we find this constraint is useful for training stability, and to reduce reward hacking whereby we would achieve high scores from the reward model but low scores from human evaluation.
They train for between 200 and 400 iterations for all our models, and use evaluations on held-out prompts for early stopping. Each iteration of PPO on the 70B model takes on average ≈ 330 seconds