ML//Training//dataset//synthetic data
Teaching AIs with other AIs' homework — it works surprisingly well until the errors start compounding and nobody remembers what the original looked like.
Teaching AIs with other AIs' homework — it works surprisingly well until the errors start compounding and nobody remembers what the original looked like.
Risk of model collapse — errors compound across generations, diversity shrinks. Tail distribution erodes first.
But carefully curated synthetic data (math proofs, code with verification) works well — the key is external verification, not blind self-training.
The uncomfortable question: what happens when most training data is AI-generated?
Constitutional AI uses synthetic data wisely: AI generates reflection-improved responses, but the constitution provides an external anchor.