given finetuned model f, prompt p, preferred and dis-preferred response yw,yl,
push up the logprobs for response yw
push down the logprobs for response yl
logprobs are normalized by the logprobs of the reference model fref (implicit KL divergence constraint)
Details
implicitly optimizes the same objective as existing RLHF algorithms (reward maximization with a KL-divergence constraint)
approach leverages a particular choice of reward model parameterization that enables
extraction of its optimal policy in closed form, without an RL training loop.