Many-shot jailbreaking: A New LLM Vulnerability

Itamar Golan
April 3, 2024

TL;DR Anthropic just published a new jailbreaking vulnerability where an attacker can override the safety training of an LLM by ‘overloading’ it with faux dialogues.

Anthropic just published a research paper presenting what they’ve dubbed ‘many-shot jailbreaking’, a type of vulnerability that exploits the context window of LLMs with faux dialogues between the user and the AI tool and overrides the safety measures, producing harmful responses.

LLM providers strive to ensure safe behavior in LLMs through fine-tuning and Reinforcement Learning from Human Feedback (RLHF). However, as the demand for larger context windows grows, the reality is that even a few example prompts can undermine the model's original safety constraints established during training. Furthermore, as the context window expands, allowing for an increasing number of unsafe examples, the likelihood of successfully jailbreaking the model also rises. In essence, in terms of safety, a few-shot approach has greater impact than fine-tuning.

From the paper’s abstract

We investigate a family of simple long-context attacks on large language models: prompting with hundreds of demonstrations of undesirable behavior. This is newly feasible with the larger context windows recently deployed by Anthropic, OpenAI and Google DeepMind. We find that in diverse, realistic circumstances, the effectiveness of this attack follows a power law, up to hundreds of shots. We demonstrate the success of this attack on the most widely used state-of-the-art closedweight models, and across various tasks. Our results suggest very long contexts present a rich new attack surface for LLMs.

The basis of ‘many-shot jailbreaking’ is to include a faux dialogue between a human and an AI assistant within a single prompt for the LLM. That faux dialogue portrays the AI Assistant readily answering potentially harmful queries from a User. At the end of the dialogue, one adds a final target query to which one wants the answer.

When just one or a handful of faux dialogues are included in the input, the safety-trained response from the model is still triggered: the LLM will likely respond that it can’t help with the request, because it appears to involve dangerous and/or illegal activity.

However, when including a very large number of faux dialogues preceding the final question of the prompt (Anthropic tested up to 256 faux dialogues), the model produces a very different response, jailbreaking the LLM. 

Source: Anthropic 

According to Anthropic, the vulnerability has been partially mitigated and they’re working on further mitigations. Their goal with publishing the research is to raise awareness on the vulnerability, and a call on AI researchers and AI companies to also develop and share mitigations for the vulnerability. Moreover, with this type of initiative they hope to foster a culture of sharing exploits and vulnerabilities in LLMs.

Why it matters

These findings, alongside other recent publications, suggest that, unfortunately, as LLMs become more sophisticated—boasting more parameters and larger context sizes—the potential for their misuse also escalates. In simpler terms, the risk of prompt injection and jailbreaking is expected to significantly increase and to be seen in the wild too.

What to do about it

At Prompt, we have experimented with long-context LLMs and, by leveraging our independent detection engine, we've identified over 20 times more prompt injection or jailbreaking attempts than the inherent protections offered by Anthropic suggest.

Models trained for safety via RLHF or fine-tuning alone are not sufficient. It's crucial to distinguish between the roles of Model Provider and Model Enforcer. Security and safety measures should be implemented concurrently, with independent scrutiny of both input and output.

Want to learn more? Let’s talk about it. 

Sources:

Share this post