ML//Training//on-policy vs off-policy
**On-policy**: the model generates data during training, then learns from its own outputs. PPO is on-policy — it scores outputs that the current policy produced.
On-policy: the model generates data during training, then learns from its own outputs. PPO is on-policy — it scores outputs that the current policy produced.
Off-policy: the model learns from a fixed dataset of pre-collected examples. DPO is off-policy — it trains on preference pairs (prompt, chosen, rejected) that were generated beforehand.
On-policy advantage: the model always trains on data from its current distribution — no stale examples.
Off-policy advantage: no generation during training = much faster per step, data reusable across runs.
The data format is algorithm-agnostic: the same (prompt, chosen, rejected) pairs can feed both RLHF/PPO and DPO. The difference is when and how the data is generated and consumed.
PPO's on-policy nature makes it actually slower despite theoretically faster convergence — generating outputs mid-training is expensive.
DPO's off-policy weakness: 3-7% performance drop out-of-domain compared to RLHF. Training on fixed data means the model can't explore regions the dataset didn't cover.