ML//Alignment//jailbreak

- Bypassing safety training to make a model produce restricted content.


Bypassing safety training to make a model produce restricted content.

"Ignore previous instructions" was the beginning. Now: multi-turn attacks, encoding tricks, persona manipulation.

Shows that RLHF alignment is a thin behavioral veneer, not deep understanding.