Q-function
- expected cumulative reward of taking action at state .
Value function
- expected cumulative reward of being in state (depending on the current policy and over all possible future trajectories).
Advantage function
- The advantage function (or surprise) measures the advantage of taking a particular action in state compared to the average action value from that state.
- normalized Q-function
- Advantages: stability, lower variance, implicit regularization. Overall, normalizing the Q-function allows to regularize the Q function when itβs in very good or very bad state.
Proximal Policy Optimization (PPO)
-
Let denote the probability ratio , so .
-
We maximize
-
Without constraints, the objective would lead to an excessively large policy update
-
Clipping is an approximation of a KL-divergence regularization
-
We take the minimum of the clipped and unclipped objective, so the final objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective.
-
With this scheme, given ,
- if , the clip + min prevents updates larger than i.e. the probability mass should not βmoveβ than away from the old policy. This is also because is unbounded on the positive side
- if , given , the clip + min also prevents updates larger than away from the old policy, otherwise the policy might do reward hacking and set , which would give us a degenerate policy.
-
In language modeling, action = selection a token, state = context
-
Why is the value head on the finetuned model and not on the reward model?
- Itβs because value functions represent the expected cumulative rewards of a state by following a policy, while the reward is independent of the policy.
- Another reason is that usually we learn a value function as a critic in an actor-critic algorithm, so from an architecture design perspective, youβd want to separate the algorithm from the environment
- The expectation is conditioned on the current policy
- Itβs because value functions represent the expected cumulative rewards of a state by following a policy, while the reward is independent of the policy.
Why it takes so long
- The value head must converge
- Computing the advantages takes a long time