ML//Alignment//Goodhart's curse

When a measure becomes a target, it ceases to be a good measure.


When a measure becomes a target, it ceases to be a good measure.

The moment you write a constitution ("be helpful, be honest, be harmless"), the model optimizes for what those words mean in the training distribution, not what you meant philosophically.

Values get compressed into loss functions, and loss functions are leaky approximations.

"be helpful" → sycophancy. "be honest" → brutal bluntness. "be harmless" → refuses everything.

Even if we could scale oversight forever, we'd still need to answer: "good according to what, exactly?"

Every rule has edge cases, every value conflicts with another (freedom vs safety, honesty vs kindness)

The gap between "what we wrote" and "what we meant" is where catastrophic misalignment lives (the value specification problem)