ML//Training//dataset//synthetic data

Teaching AIs with other AIs' homework — it works surprisingly well until the errors start compounding and nobody remembers what the original looked like.


Teaching AIs with other AIs' homework — it works surprisingly well until the errors start compounding and nobody remembers what the original looked like.

Risk of model collapse — errors compound across generations, diversity shrinks. Tail distribution erodes first.

But carefully curated synthetic data (math proofs, code with verification) works well — the key is external verification, not blind self-training.

The uncomfortable question: what happens when most training data is AI-generated?

Constitutional AI uses synthetic data wisely: AI generates reflection-improved responses, but the constitution provides an external anchor.