Group Relative Policy Optimization

GRPO was first introduced in DeepSeekMath(in 2024 Feb) but received much wider recognition after DeepSeek R1's success.

\begin{aligned} \mathcal{J}_{GRPO}(\theta)=\mathbb{E}[q\sim P(Q)] \end{aligned}

In contrast to methods like PPO, GRPO foregoes the critic model that is typically with the same size as the policy model, and estimates the baseline from group scores instead. Specifically, for each question $q$ , GRPO samples a group of outputs $\{o_{1},o_{2},\cdots,o_{G}\}$ from the old policy model $\pi_{\theta_{\text{old}}}$ and then optimizes the policy model $\pi_{\theta}$ by maximizing the following objective:

\begin{aligned} \mathcal{J}_{GRPO}(\theta)&=E[q\sim P(Q),\{o_{i}\}_{i=1}^{G}\sim\pi_{\theta_{\text{old}}}(O|q)] \\ &=\frac{1}{G}\sum_{i=1}^{G}\left[\min\left\{\frac{\pi_{\theta}(o_{i}|q)}{\pi_{\theta_{\text{old}}}(o_{i}|q)}A_{i}, \text{clip}\left(\frac{\pi_{\theta}(o_{i}|q)}{\pi_{\theta_{\text{old}}}(o_{i}|q)},1-\epsilon,1+\epsilon\right)A_{i}\right\}-\beta\mathbb{D}_{\text{KL}}[\pi_{\theta}(o_{i}|q)||\pi_{\text{ref}}(o_{i}|q)]\right] \end{aligned}

In DeepSeek series, KL divergence is approximated by the following unbiased estimator(Schulman, 2020):

\mathbb{D}_{\text{KL}}(\pi_{\theta}||\pi_{\text{ref}})=\frac{\pi_{\text{ref}}(o_{i}|q)}{\pi_{\theta}(o_{i}|q)}-\log\frac{\pi_{\text{ref}}(o_{i}|q)}{\pi_{\theta}(o_{i}|q)}-1

which is guaranteed to be non-negative.

$A_{i}$ is the advantage, derived from the rewards $\{r_{1},r_{2},\cdots,r_{G}\}$ corresponding to the outputs within each group:

A_{i}=\frac{r_{i}-\text{mean}(\{r_{1},r_{2},\cdots,r_{G}\})}{\text{std}(\{r_{1},r_{2},\cdots,r_{G}\})}.