ML//Training//RLHF
Humans choosing between Bx and By — generated using different temperatures or strategies.
Humans choosing between Bx and By — generated using different temperatures or strategies.
The preference pairs (By > Bx) feed either DPO directly or a RM for PPO
Data format is algorithm-agnostic: same (prompt, chosen, rejected) triplets serve any preference-based method — DPO, PPO, GRPO
InstructGPT (GPT-3.5) was the inflection point — took GPT-3 and added SFT + RLHF, creating the first "assistant".
The feedback loop works while humans can still judge which output is better.
Safe RLHF splits the annotation: separate human scores for helpfulness and harmlessness, separate RMs, controllable tradeoff.