ML//Training//on-policy vs off-policy

**On-policy**: the model generates data during training, then learns from its own outputs. PPO is on-policy — it scores outputs that the current policy produced.


On-policy: the model generates data during training, then learns from its own outputs. PPO is on-policy — it scores outputs that the current policy produced.

Off-policy: the model learns from a fixed dataset of pre-collected examples. DPO is off-policy — it trains on preference pairs (prompt, chosen, rejected) that were generated beforehand.

On-policy advantage: the model always trains on data from its current distribution — no stale examples.

Off-policy advantage: no generation during training = much faster per step, data reusable across runs.

The data format is algorithm-agnostic: the same (prompt, chosen, rejected) pairs can feed both RLHF/PPO and DPO. The difference is when and how the data is generated and consumed.

PPO's on-policy nature makes it actually slower despite theoretically faster convergence — generating outputs mid-training is expensive.

DPO's off-policy weakness: 3-7% performance drop out-of-domain compared to RLHF. Training on fixed data means the model can't explore regions the dataset didn't cover.