ML//Training//PPO

2026-02-15

- Known as --Proximal Policy Optimisation--.

ROBOTICS

Known as Proximal Policy Optimisation.

Learns through exploration ("proximal" steps).

Policy learns in a physical environment (NVIDIA OMNIVERSE)

LLMs

The environment becomes here a RM trained via RLHF.In DPO, RLHF is used to [d]irectly update the weights (policy).

On-policy: generates outputs during training and scores them in real-time via a RM. No contrastive pairs needed upfront.

Presumes holistic properties require sequence-level scoring. False, IMO.

Faster than DPO in theory (no humans clicking or robots reasoning beforehand), yet on-Policy behavior (scoring during training) makes it actually slower.

In addition to everything DPO uses, adds: the reward model, a critic that reduces variance by predicting scores, plus LLM actor and reference model for KL divergence.

SGD reward signal derives from sequence-level scalar, not token-level logit derivatives. Credit assignment problem: every token gets the same offset.

Considered RLAIF when using a reward model trained from RLHF data.