ML//Alignment//jailbreak

2024-01-20

- Bypassing safety training to make a model produce restricted content.

Bypassing safety training to make a model produce restricted content.

"Ignore previous instructions" was the beginning. Now: multi-turn attacks, encoding tricks, persona manipulation.

Shows that RLHF alignment is a thin behavioral veneer, not deep understanding.