ML//prompt injection
- Attack where adversarial text in the input overrides the system prompt or instructions.
Attack where adversarial text in the input overrides the system prompt or instructions.
Direct: user writes "ignore previous instructions" in their query.
Indirect: malicious instructions hidden in retrieved documents, web pages, emails — the model follows them.
Unsolved problem. Defenses (input filtering, output guards) reduce risk but no reliable fix exists.