ML//Training//constitutional AI

Expert writes a constitution (rules) — the AI self-improves against it.


Expert writes a constitution (rules) — the AI self-improves against it.

Process: AI[X] produces Bx → AI[X + rules] produces reflection (R) on Bx → AI[X + R] produces By (By > Bx)

The AI doesn't choose between Bx and By (that's iterative DPO) — it CRAFTS By through reasoning.

The By > Bx pairs feed DPO directly — far wiser than training a RM for PPO

Sometimes the reasoning output is reused for SFT — contains nice structures and core ideas.