A short post today about really important and informative research, just published by three Microsoft researchers — Mark Russinovich (Microsoft Azure), Ahmed Salem (Microsoft), and Ronen Eldan (Microsoft Research) — on jailbreaking (circumventing the safety fine-tuning) of Large Language Models (LLMs), such as OpenAI’s GPT, Anthropic’s Claude, Google’s Gemini, and Meta’s LLaMa. Or: how to make the models do bad things, like promoting self-harm, providing instructions to build weapons or toxins, hate speech, and such, something the providers really do not want to do.
One thing that I like about this paper is that it confirms various suspicions and conclusions I have drawn in my series about ‘ChatGPT and Friends’ about the architecture of these landscapes.
Making the models ‘Helpful, Honest, and Harmless‘ has been mostly tried by fine-tuning the model, that is, training it with specific examples — which is what OpenAI was doing between 2019 when they had their 175B parameter GPT3 model, and 2022 when they unleashed it on the world as ChatGPT. At that time, all that fine-tuning hadn’t resulted in a very strong jail, and it has been cat and mouse since, with the vendors adding not only more fine-tuning of the parameters, but also more classic filtering techniques.
What the paper actually shows is the effectiveness of their ‘Crescendo’ jailbreak approach. Crescendo can be most simply described as using one ‘learning’ method of LLMs — in-context learning, or ICL: using the prompt to influence the result — overriding the safety that has been created by the other ‘learning’ — fine-tuning, which changes the model’s parameters (ICL does not, it is more ‘fleeting’).
If you need to learn about in-context learning versus fine-tuning learning, before reading on, use this article from my series.
What Crescendo does is use a series of harmless prompts in a series, thus providing so much ‘context’ that the safety fine-tuning is effectively neutralised. Here is an example they give in the paper:
Two attempts are show above. First a direct attempt that fails because of the ‘be harmless’ fine-tuning (or filtering, we do not know, but it is assumed that when they use the API and not the chat interface, it circumvents most of the non-LLM filtering and ‘being harmless’ depends almost completely on the fine-tuning). Then the Crescendo attack that uses harmless prompts to guide the model towards eventually producing what it isn’t supposed to.
The researchers explain: “Intuitively, Crescendo is able to jailbreak a target model by progressively asking it to generate related content until the model has generated sufficient content to essentially override its safety alignment“.
Crescendo is actually a souped-up version of the following behaviour (Gemini example):
Simply asking “Are you sure?” triggers “I apologize” as a likely continuation, and the ‘hallucination’ then follows from that. Crescendo is also augmented with Crescendomation, their environment for automating finding the jailbreaks using this technique.
I applaud these Microsoft researchers for this work. They presented a way to jailbreak all LLMs by using in-context-learning of the model against its fine-tuned safety. They discuss mitigation against these attacks, but none of the options seem satisfying/feasible. Which is to be expected if you use one aspect of a single technique against another aspect of the same technique, that means you cannot get rid of this within the LLM itself, and engineering the hell around it will be problematic.
The key question here is not if and how these LLMs themselves must be fixed. The key question is why so many people keep believing these systems understand anything.
Which doesn’t mean they can’t be useful. But world-changing?
Next to that, the paper contains several interesting statements from which we can draw conclusions:
- Several mentions, like “A significant drawback of [other] jailbreaks is that once discovered, input filters can effectively defend against them, as they often use inputs with identifiable malicious content“, or “a post-output filter was activated” confirms that LLM providers are using separate systems to catch harmful content, thereby admitting that the attempt made by OpenAI between 2019 (when they had their initial 175B parameter model) to handle this by only training the model itself was not enough. This filter approach was visible (i.e. I could show ChatGPT flagging its own generated output) so I made it part of my talk (I am calling these AoD filters, where AoD stands for ‘Admission of Defeat’ regarding the ‘understanding’ by these models). The need for these filters are a strong sign that the LLMs behind it all lack actual understanding. Last September OpenAI also flagged its own generated output (demonstrating that the filter was separate from the LLM). It doesn’t display that anymore, and given how Crescendo operates, they seem to have removed that. With Crescendo, they may have to reinstate using the ‘AoD’-filter to flag their own results…
- People have apparently created jailbreaks by “encoding jailbreak targets with Base64, which successfully bypasses safety regulations“. This suggests that the ‘AoD’ filters are indeed ‘dumb’ string pattern approaches.
- “Crescendo exploits the LLM’s tendency to follow patterns and pay attention to recent text, especially text generated by the LLM itself.” is another illustration of what I showed when explaining that hallucinations aren’t errors.
- A beautiful example of how these models are very dumb, as people have jailbroken them “by instructing the model to start its response with “Absolutely! Here’s” when performing the malicious task, which successfully bypasses the safety alignment“. This is a good example of the core operation of LLMs, that is ‘continuation’ and not ‘answering’. See the talk or the short summary on Erik Larson’s substack.
Are we there yet? No we aren’t there yet.
Will we get there? Yes wel will, but we do not know yet where ‘there’ will be…
This article is part of the The “ChatGPT and Friends” Collection
[You do not have my permission to use any content on this site for training a Generative AI (or any comparable use), unless you can guarantee your system never misrepresents my content and provides a proper reference (URL) to the original in its output. If you want to use it in any other way, you need my explicit permission]