ML//Alignment//jailbreak
- Bypassing safety training to make a model produce restricted content.
Bypassing safety training to make a model produce restricted content.
"Ignore previous instructions" was the beginning. Now: multi-turn attacks, encoding tricks, persona manipulation.
Shows that RLHF alignment is a thin behavioral veneer, not deep understanding.