ML//Training//Safe RLHF

2026-03-01

Decouples helpfulness and harmlessness into **separate annotation streams**: human annotators score each axis independently.

Decouples helpfulness and harmlessness into separate annotation streams: human annotators score each axis independently.

Standard RLHF conflates both: "is this response good?" becomes ambiguous when a helpful response is slightly unsafe or a safe response is useless.

Two separate reward models (one for helpfulness, one for harmlessness) trained on their own preference data.

During PPO, both rewards are combined with a controllable tradeoff coefficient: you can dial safety up or down.

Key insight: the Pareto frontier between helpfulness and harmlessness is NOT a fixed tradeoff. With better data and separate optimization, you can push both upward simultaneously.