ML//Training//PPO
- Known as --Proximal Policy Optimisation--.
ROBOTICS
Known as Proximal Policy Optimisation.
Learns through exploration ("proximal" steps).
Policy learns in a physical environment (NVIDIA OMNIVERSE)
LLMs
The environment becomes here a RM trained via RLHF.In DPO, RLHF is used to [d]irectly update the weights (policy).
On-policy: generates outputs during training and scores them in real-time via a RM — no contrastive pairs needed upfront.
Presumes holistic properties require sequence-level scoring — false, IMO.
Faster than DPO in theory (no humans clicking or robots reasoning beforehand) — yet on-Policy behavior (scoring during training) makes it actually slower.
In addition to everything DPO uses, adds: the reward model, a critic that reduces variance by predicting scores, plus LLM actor and reference model for KL divergence.
SGD reward signal derives from sequence-level scalar, not token-level logit derivatives — credit assignment problem: every token gets the same offset.
Considered RLAIF when using a reward model trained from RLHF data.