RL basics

Q-function

expected cumulative reward of taking action $a$ at state $s$ .
$Q^{π} (s, a) = E_{π} [\sum_{t = 0}^{\infty} γ^{t} r_{t} ∣ s_{0} = s, a_{0} = a]$

Value function

expected cumulative reward of being in state $s$ (depending on the current policy and over all possible future trajectories).
$V^{π} (s) = E_{π} [\sum_{t = 0}^{\infty} γ^{t} r_{t} ∣ s_{0} = s]$

Advantage function

The advantage function (or surprise) measures the advantage of taking a particular action $a$ in state $s$ compared to the average action value from that state.
normalized Q-function
$A (s, a) = Q (s, a) - V (s)$
Advantages: stability, lower variance, implicit regularization. Overall, normalizing the Q-function allows to regularize the Q function when it’s in very good or very bad state.

Proximal Policy Optimization (PPO)

Let $r_{t} (θ)$ denote the probability ratio $r_{t} (θ) = \frac{π _{θ} ( a _{t} ∣ s _{t} )}{π _{θ_{o l d}} ( a _{t} ∣ s _{t} )}$ , so $r (θ_{o l d}) = 1$ .
We maximize $L (θ) = E_{t} [min (r_{t} (θ) A_{t}, c l i p (r_{t} (θ), 1 - ϵ, 1 + ϵ) A_{t})]$
Without constraints, the objective would lead to an excessively large policy update
Clipping is an approximation of a KL-divergence regularization
We take the minimum of the clipped and unclipped objective, so the final objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective.
With this scheme, given $r_{t} \in] 0, \infty [$ ,
- if $A_{t} > 0$ , the clip + min prevents updates larger than $(1 + ϵ) A_{t}$ i.e. the probability mass should not “move” than $ϵ$ away from the old policy. This is also because $r_{t}$ is unbounded on the positive side
- if $A_{t} < 0$ , given $r_{t} > 0$ , the clip + min also prevents updates larger than $ϵ$ away from the old policy, otherwise the policy might do reward hacking and set $π_{θ} (a_{t} ∣ s_{t}) = 0$ , which would give us a degenerate policy.
In language modeling, action = selection a token, state = context
Why is the value head on the finetuned model and not on the reward model?
- It’s because value functions represent the expected cumulative rewards of a state by following a policy, while the reward is independent of the policy.
  - Another reason is that usually we learn a value function as a critic in an actor-critic algorithm, so from an architecture design perspective, you’d want to separate the algorithm from the environment
  - The expectation is conditioned on the current policy

Why it takes so long

The value head must converge
Computing the advantages takes a long time

🤖 Harold's Notes

Explorer

RL basics

Q-function

Value function

Advantage function

Proximal Policy Optimization (PPO)

Why it takes so long

Graph View

Table of Contents

Backlinks