ML//Training//RLHF

Humans choosing between Bx and By — generated using different temperatures or strategies.


Humans choosing between Bx and By — generated using different temperatures or strategies.

The preference pairs (By > Bx) feed either DPO directly or a RM for PPO

Data format is algorithm-agnostic: same (prompt, chosen, rejected) triplets serve any preference-based method — DPO, PPO, GRPO

InstructGPT (GPT-3.5) was the inflection point — took GPT-3 and added SFT + RLHF, creating the first "assistant".

The feedback loop works while humans can still judge which output is better.

Safe RLHF splits the annotation: separate human scores for helpfulness and harmlessness, separate RMs, controllable tradeoff.