TLDR

  • given finetuned model , prompt , preferred and dis-preferred response ,
    • push up the logprobs for response
    • push down the logprobs for response
    • logprobs are normalized by the logprobs of the reference model (implicit KL divergence constraint)

Details

  • implicitly optimizes the same objective as existing RLHF algorithms (reward maximization with a KL-divergence constraint)
  • approach leverages a particular choice of reward model parameterization that enables extraction of its optimal policy in closed form, without an RL training loop.