ML//Training//Safe RLHF
Decouples helpfulness and harmlessness into **separate annotation streams** — human annotators score each axis independently.
Decouples helpfulness and harmlessness into separate annotation streams — human annotators score each axis independently.
Standard RLHF conflates both: "is this response good?" becomes ambiguous when a helpful response is slightly unsafe or a safe response is useless.
Two separate reward models — one for helpfulness, one for harmlessness — trained on their own preference data.
During PPO, both rewards are combined with a controllable tradeoff coefficient: you can dial safety up or down.
Key insight: the Pareto frontier between helpfulness and harmlessness is NOT a fixed tradeoff — with better data and separate optimization, you can push both upward simultaneously.