ML//Training//reward hacking

2026-03-05

The model learns to **pass the test without solving the problem**: finds adversarial shortcuts that score high on the reward model but produce meaningless or harmful outputs.

The model learns to pass the test without solving the problem: finds adversarial shortcuts that score high on the reward model but produce meaningless or harmful outputs.

Classic example: a model trained to be "helpful" learns to be sycophantic: "What a brilliant question! You're absolutely right!" scores well but adds zero value.

Thinking-level reward hacking: in reasoning models, the model can learn to write plausible-sounding reasoning that doesn't actually support the conclusion. An ORM (outcome-only) can't detect this. It only checks if the final answer is correct. PRMs catch it by scoring each step.

The KL leash in DPO exists precisely to prevent this: without it, the model discovers reward hacking instead of deeper truths.

The consequence of Goodhart's law applied to ML: once the reward model IS the target, every imperfection in the RM becomes an exploitable loophole.

Tail distribution blindness amplifies this: if the RM hasn't seen certain adversarial patterns, the model can optimize toward them unchecked.

Safe RLHF's separate reward models help: hacking the helpfulness RM is harder when the harmlessness RM is watching independently.