ML//Alignment

The scalable oversight problem: we can't supervise what we can't understand.


The scalable oversight problem: we can't supervise what we can't understand.

The RLHF feedback loop works while the human can still judge which output is better.

The RLAIF feedback loop works while the human can write complex enough constitutions and trusts the model to follow them as intended.

In creating new knowledge or insight, we can't tell genius from hallucination — neither do we know how to align constitutions reliably.

The billion-dollar question: can AI reliably supervise itself without collapsing into reward hacking or echo chambers? nobody knows yet.