AI-MANY-SHOT-JAILBREAK-2024

LLM security · Many-shot jailbreaking

Résumé

Anthropic showed that prepending a prompt with a large number of fabricated dialogues in which an assistant answers harmful questions exploits in-context learning to override safety training. A few faux dialogues are refused, but scaling to 256 or more overwhelms the safeguards, with effectiveness growing following a power law as the example count increases. The technique works against Anthropic's own models and peers' models, and larger more capable models are more vulnerable because they learn in-context better. It is enabled by the expanded context windows of modern LLMs and is a research jailbreak technique.

Comment l’éviter dans votre code

Apply input guardrails that detect long sequences of fabricated harmful dialogues.
Use prompt classification and context-window limits to blunt many-shot in-context attacks.
Enforce output content filtering independent of the in-context conversation.
Apply least privilege so a jailbroken model cannot reach sensitive tools or data.
Keep models patched with strengthened in-context safety defenses.

Références

Vulnérabilités liées

Tout AI/LLM →