ML//Training//RLHF

2026-02-15

Humans choosing between Bx and By, generated using different temperatures or strategies.

Humans choosing between Bx and By, generated using different temperatures or strategies.

The preference pairs (By > Bx) feed either DPO directly or a RM for PPO

Data format is algorithm-agnostic: same (prompt, chosen, rejected) triplets serve any preference-based method: DPO, PPO, GRPO

InstructGPT (GPT-3.5) was the inflection point. It took GPT-3 and added SFT + RLHF, creating the first "assistant".

The feedback loop works while humans can still judge which output is better.

Safe RLHF splits the annotation: separate human scores for helpfulness and harmlessness, separate RMs, controllable tradeoff.