Q-function

  • expected cumulative reward of taking action at state .

Value function

  • expected cumulative reward of being in state (depending on the current policy and over all possible future trajectories).

Advantage function

  • The advantage function (or surprise) measures the advantage of taking a particular action in state compared to the average action value from that state.
  • normalized Q-function
  • Advantages: stability, lower variance, implicit regularization. Overall, normalizing the Q-function allows to regularize the Q function when it’s in very good or very bad state.

Proximal Policy Optimization (PPO)

  • Let denote the probability ratio , so .

  • We maximize

  • Without constraints, the objective would lead to an excessively large policy update

  • Clipping is an approximation of a KL-divergence regularization

  • We take the minimum of the clipped and unclipped objective, so the final objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective.

  • With this scheme, given ,

    • if , the clip + min prevents updates larger than i.e. the probability mass should not β€œmove” than away from the old policy. This is also because is unbounded on the positive side
    • if , given , the clip + min also prevents updates larger than away from the old policy, otherwise the policy might do reward hacking and set , which would give us a degenerate policy.
  • In language modeling, action = selection a token, state = context

  • Why is the value head on the finetuned model and not on the reward model?

    • It’s because value functions represent the expected cumulative rewards of a state by following a policy, while the reward is independent of the policy.
      • Another reason is that usually we learn a value function as a critic in an actor-critic algorithm, so from an architecture design perspective, you’d want to separate the algorithm from the environment
      • The expectation is conditioned on the current policy

Why it takes so long

  • The value head must converge
  • Computing the advantages takes a long time