ML//Training//reward model
A neural network trained to score AI outputs on a scalar — "how good is this response?"
A neural network trained to score AI outputs on a scalar — "how good is this response?"
Trained using "Bx > By" preference pairs from RLHF or RLAIF
Used by PPO to provide reward signal during training.
The quality ceiling of PPO is the quality ceiling of the reward model — every blind spot becomes a reward hacking target.
Not just a training artifact — a standalone product. You can use an RM at inference time to score and filter outputs, rerank candidates, or detect policy violations without retraining.
Safe RLHF trains TWO separate RMs — helpfulness and harmlessness — to avoid conflating the two axes into one ambiguous scalar.