ML//Training//Safe RLHF

Decouples helpfulness and harmlessness into **separate annotation streams** — human annotators score each axis independently.


Decouples helpfulness and harmlessness into separate annotation streams — human annotators score each axis independently.

Standard RLHF conflates both: "is this response good?" becomes ambiguous when a helpful response is slightly unsafe or a safe response is useless.

Two separate reward models — one for helpfulness, one for harmlessness — trained on their own preference data.

During PPO, both rewards are combined with a controllable tradeoff coefficient: you can dial safety up or down.

Key insight: the Pareto frontier between helpfulness and harmlessness is NOT a fixed tradeoff — with better data and separate optimization, you can push both upward simultaneously.