Group Relative Policy Optimization
GRPO was first introduced in DeepSeekMath(in 2024 Feb)
but received much wider recognition after DeepSeek R1's success.
JGRPO(θ)=E[q∼P(Q)]
In contrast to methods like PPO, GRPO foregoes the critic model that is typically with the same size as the policy model,
and estimates the baseline from group scores instead.
Specifically, for each question q, GRPO samples a group of outputs {o1,o2,⋯,oG}
from the old policy model πθold and then optimizes the policy model πθ
by maximizing the following objective:
JGRPO(θ)=E[q∼P(Q),{oi}i=1G∼πθold(O∣q)]=G1i=1∑G[min{πθold(oi∣q)πθ(oi∣q)Ai,clip(πθold(oi∣q)πθ(oi∣q),1−ϵ,1+ϵ)Ai}−βDKL[πθ(oi∣q)∣∣πref(oi∣q)]]
In DeepSeek series, KL divergence is approximated by the following unbiased estimator(Schulman, 2020):
DKL(πθ∣∣πref)=πθ(oi∣q)πref(oi∣q)−logπθ(oi∣q)πref(oi∣q)−1
which is guaranteed to be non-negative.
Ai is the advantage, derived from the rewards {r1,r2,⋯,rG} corresponding to the outputs within each group:
Ai=std({r1,r2,⋯,rG})ri−mean({r1,r2,⋯,rG}).