🤖 Harold's Notes

Search

❯

❯

❯

❯

DPO

Jul 03, 20241 min read

TLDR

given finetuned model $f$ , prompt $p$ , preferred and dis-preferred response $y_{w}, y_{l}$ ,
- push up the logprobs for response $y_{w}$
- push down the logprobs for response $y_{l}$
- logprobs are normalized by the logprobs of the reference model $f_{re f}$ (implicit KL divergence constraint)

Details

implicitly optimizes the same objective as existing RLHF algorithms (reward maximization with a KL-divergence constraint)
approach leverages a particular choice of reward model parameterization that enables extraction of its optimal policy in closed form, without an RL training loop.

Graph View

TLDR
Details

Backlinks

No backlinks found

Created with Quartz v4.2.3 © 2024