PPO

Let $r_t(theta)$ denote the probability ratio $r_t(theta)=pi_theta(a_t|s_t) / pi_(theta_"old")(a_t|s_t)$ , so $r(theta_"old")=1$ . TRPO maximizes a "surrogate" objective

L^"CPI"(theta)=hat(E)_t [pi_theta(a_t|s_t) / pi_(theta_"old")(a_t|s_t) hat(A)_t]=hat(E)_t[r_t(theta)hat(A)_t].

The superscript CPI refers to conservative policy iteration. Without a constraint, maximization of $L^"CPI"$ would lead to an excessively large policy update; hence, we now consider how to modify the objective, to penalize changes to the policy that move $r_t(theta)$ away from 1. The main objective proposed is the following:

L^"CLIP"(theta)=hat(E)[min{r_t(theta)hat(A)_t,"clip"(r_t,1-epsilon,1+epsilon)(theta)hat(A)_t}]

where $epsilon$ is a hyperparameter. The motivation for this objective is as follows. The first term inside the min is $L^"CPI"$ . The second term, $"clip"(r_t(theta), 1-epsilon, 1+epsilon)hat(A)_t$ , modifies the surrogate objective by clipping the probability ratio, which removes the incentive for moving $r_t$ outside of the interval $[1-epsilon, 1+epsilon]$ . Finally, we take the minimum of the clipped and unclipped objective, so the final objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective. With this scheme, we only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse.

Using a truncated version of generalized advantage estimation(GAE), the advantage estimator is:

hat(A)_t=delta_t+(gamma lambda)delta_(t+1)+dots.h.c+(gamma lambda)^(T-t+1)delta_(T-1)

where $delta_(t)=r_t+gamma V(s_(t+1))-V(s_t)$ is the TD residual.

[!readme]Readme

The N Implementation Details of RLHF with PPO