ML//Training
The pipeline that takes a raw model from "predicts text" to "follows instructions" to "aligns with human values".
The pipeline that takes a raw model from "predicts text" to "follows instructions" to "aligns with human values".
Rough evolution: pre-training → SFT → RLHF/DPO
Data format is algorithm-agnostic: same (prompt, chosen, rejected) triplets feed DPO, PPO, GRPO. The difference is on-policy vs off-policy consumption.
Catastrophic forgetting is real — training only on new data overwrites old knowledge; solution: replay buffers, mixing a % of old data during new training.
Taxonomic annotation (labeling) is a data enrichment step that can feed into any method: SFT training pairs, constitution enhancement for DPO, pre-training data, or RAG