ML//Training//DPO

2026-02-15

Like SFT but with two options: maps "this is better than" (By > Bx contrastive pairs) directly back to individual token derivatives, widening the logit gap between tokens that led to the better path.

Like SFT but with two options: maps "this is better than" (By > Bx contrastive pairs) directly back to individual token derivatives, widening the logit gap between tokens that led to the better path.

The model computes logits for BOTH outputs and adjusts weights so the preferred output gets higher logits (feels natural) and the rejected gets lower (feels unnatural)

No ceiling: the AI is free to roam latent space, finding new ways to be smart (relational gradient learning → better in-context knowledge retrieval)

Can only rearrange knowledge already baked in during pre-training, never creating new knowledge.

KL divergence acts as an anchor: penalizes if the new probability distribution drifts too far from the reference model. Without it: reward hacking (adversarial shortcuts that score high but mean nothing)

Off-policy: trains on pre-collected preference pairs. 3-7% performance drop out-of-domain compared to RLHF: can't explore beyond the fixed dataset.

Data format is algorithm-agnostic: same (prompt, chosen, rejected) triplets feed both DPO and PPO. The difference is consumption pattern, not data shape.

If robots do it through NNs or LLM reasoning, it's iterative DPO, the best training method today.