ML//Training//GRPO

2026-03-03

Group Relative Policy Optimization, DeepSeek's alternative to PPO for RL fine-tuning.

Group Relative Policy Optimization, DeepSeek's alternative to PPO for RL fine-tuning.

Score multiple completions relative to each other within a group, no separate RM needed.

Simpler, more stable, and cheaper than PPO. Used to train DeepSeek R1's reasoning capabilities.

Part of the "RL without process reward models" approach that challenged OpenAI's PRM methods.