ML//Training//reward model

2026-02-15

A neural network trained to score AI outputs on a scalar: "how good is this response?"

A neural network trained to score AI outputs on a scalar: "how good is this response?"

Trained using "Bx > By" preference pairs from RLHF or RLAIF

Used by PPO to provide reward signal during training.

The quality ceiling of PPO is the quality ceiling of the reward model: every blind spot becomes a reward hacking target.

Not just a training artifact: a standalone product. You can use an RM at inference time to score and filter outputs, rerank candidates, or detect policy violations without retraining.

Safe RLHF trains TWO separate RMs (helpfulness and harmlessness) to avoid conflating the two axes into one ambiguous scalar.