ML//Training//reward model

A neural network trained to score AI outputs on a scalar — "how good is this response?"


A neural network trained to score AI outputs on a scalar — "how good is this response?"

Trained using "Bx > By" preference pairs from RLHF or RLAIF

Used by PPO to provide reward signal during training.

The quality ceiling of PPO is the quality ceiling of the reward model — every blind spot becomes a reward hacking target.

Not just a training artifact — a standalone product. You can use an RM at inference time to score and filter outputs, rerank candidates, or detect policy violations without retraining.

Safe RLHF trains TWO separate RMs — helpfulness and harmlessness — to avoid conflating the two axes into one ambiguous scalar.