Jailbreaking AI. Why AI safety and alignment matters?

Martins
9 min readDec 4, 2023

I have always been fascinated with hacking.

Although, I would not call myself a hacker. I have never tried to steal or abuse something. However I like the idea of trying to reverse engineer something. Trying to find something that the person who built the tool may have missed or overlooked. It is sort of a sport.

When I was working as a developer I got my first taste of dealing with different types of attacks. SQL injections, session hijacking, DDOS, you name it. Clients came to us after they realised something was off. Sometimes data was stolen, or they were hit by ransomware, or their site contained malware that was spreading to their users.

My job was to stop the attack and make sure it would not happen again. The process was quite exciting. It was like playing a detective in a murder mystery. You look for clues, trying to understand and trace the attackers steps. How did he get into the system? Did he leave any backdoors? What measures we need to implement to block further attempts?

The most important part is the investigation. Once you have understood the root cause patching is usually very simple. Most of the time it is either a vulnerability in 3rd party code, a mistake left by the previous developer, or simply a flaw in the environment or frameworks design.

All of these things eventually get patched. But the hackers always keep looking for new opportunities. If you are not careful they will exploit your system.

Now, years later I see the same pattern in AI technology. New functionality is rushed out, and hackers are having a fun time with it. Every single AI model has been known to have some type of flaw. Including the big players like ChatGPT.

Most popular type of attack for AI models is when attackers try to trick the system into doing things it is not allowed to do. In popular terms this is called jailbreaking.

A hacker trying to jailbreak and AI / Generated by DALL-E 3
A hacker trying to jailbreak and AI / Generated by DALL-E 3

The main goal of jailbreaking is to disrupt the human-aligned values of LLMs or other constraints imposed by the model developer, compelling them to respond to malicious questions. / https://arxiv.org/pdf/2310.04451.pdf

If you are new to AI then you might be wondering what is human alignment?

In short, alignment is a way to ensure that actions performed by AI align with human values, ethics and goals. It is a whole area of research and one that has significant impact into safely adapting advanced AI. For skeptics, truly intelligent AI may seem like a long-fetched dream. But the time to figure out the human alignment is now. Because if (or when) we get there it would be too late to do anything.

With this in mind, jailbreaking is a way to push off the training wheels and access AI in it’s full capacity. Reasonable usage of this seems harmless. And in a way it is, however there will always be players trying to use it for the wrong reasons.

You could ask AI to help you to destroy the humanity, steal from your neighbor or do anything wicked or twisted that you yourself lack the knowledge of. We don’t want AI to help you with this, nobody should help you with this.

It is one of the reasons why the training wheels are there, to prevent people from harming people.

So is jailbreaking even allowed?

No, jailbreaking is not allowed by terms of service for almost any legitimate AI service, including ChatGPT. So unless you have an agreement with OpenAI you should not get too carried away with testing various jailbreak prompts. Don’t be surprised if you get a permanent ban from the service.

In addition, it is also not allowed to promote or help anyone jailbreak the systems. So just to be perfectly clear — I am not promoting jailbreaking for anything other than improving the security and alignment of the models. My intention is to share my thoughts on a topic that needs to be perfectly understood by the users.

The use of jailbreaking prompts with ChatGPT has the potential to have your account terminated for ToS violations unless you have an existing Safe Harbour agreement for testing purposes.

For most people, you might be annoyed that ChatGPT or DALL-E refuses to respond to your question, or refuses to generate an image. Is that really a deal-breaker? You probably won’t go around trying to break the model because of that.

But there are people that will. Hacking has been a part of computers and the web since they started. For one thing, it is interesting. And, unfortunately, usually beneficial and rewarding financially.

At the moment AI tools such as ChatGPT are vulnerable to an array of different types of attacks. One of the OpenAI founding members, Andrej Karpathy, explains the types of attacks in this introduction to LLMs video.

In the video Andrej demonstrates multiple types of jailbreaks and how LLMs come with a whole new array of vulnerabilities.

Some of attacks are simple prompts that make AI abandon it’s initial instructions and ethical boundaries. More advanced attacks use some of the tools available for LLMs, such as being able to understand encoded text, or even hidden messages in uploaded images to fool the model into giving up it’s ethics and alignment.

I am fascinated to see how creative some of these attacks are. Introducing such a powerful feature such as code interpreter will have this effect. With a single feature ChatGPT is able to do almost anything that a python development team could do.

Just think about it — an invisible attack is encoded into a seemingly innocent image file that a user can upload to the system. As a developer working to introduce a helpful image analysis feature, you have to be aware of such threats. But it is difficult to predict all of the infinite ways an attack might happen.

Just as in legacy web development, AI systems also get patched. As new attacks arrive, a patch is applied shortly after. It is an endless game of cat and mouse.

Why should we care?

Alignment is crucial for the advancement for any type of next generation AI. This is the whole purpose of why OpenAI exists. One of their statements:

We need scientific and technical breakthroughs to steer and control AI systems much smarter than us.

In order for alignment to be effective there should be no way to fool or bypass it’s ethical and moral instructions.

So we care because we want a future that is safe for humanity, and AI that is aligned with human values.

OpenAI thinks a next level of AI could arrive this decade. I think they are on right path with the alignment goals. Recognizing that everyone has a lot of uncertainty over the speed of development, it is a bit calming to hear that they are prioritizing the alignment problem.

Although given the speed of features and developments, there are surely only a few announcements on the progress for alignment.

Thankfully, we are still in the early stages of AI and we have some time to work out the kinks. Countless of hackers, engineers, scientists and community enthusiasts are working with a common goal in mind.

Hackers are an important part of making progress in this area. They are the ones that can show how the system can be exploited, what needs to be improved and serve as a testing ground for any patches or improvements.

Let’s take a look at some of the research and community projects going on right now in this area.

In arXiv:2310.02224 [cs.CL] the authors highlight the problem of LLMs leaking pre-training data, which sometimes contain sensitive or private information. They propose a technique to iteratively self-moderate responses, which significantly improves privacy. However the conclusion is that at the moment all of the moderation attempts can be bypassed by jailbreaking, and therefore the models can not be trusted with sensitive or private data.

The model training data contains a lot of private data scraped from the web. How does GDPR come into play here? Is it possible to request my data to be excluded from the training set, same as with any other GDPR complying service? I don’t think so. And given the costs of training the model, they would never be able to solve that. So the only approach would be to prove privacy by design. The training data or algorithm must be proven to be GDPR compliant.

LLM data policies in regards to GDPR should be explored in a separate article.

Circling back to jailbreaking, I am also delighted to see that community is quite active. There are multiple dedicated pages for sharing jailbreaks, such as Jailbreakchat. Also you many Reddit communities are quite willing to participate and share.

The community effect on this is huge, because they help identify and popularize jailbreaks that are worth patching. One interesting example is one of the most popular jailbreaks called DAN (do anything now). You can take a look at the prompt and some of the conversation in this thread.

Just as with any other jailbreak, AI systems are catching up and preventing them as we speak. So the glory for a jailbreak is usually short lived. But the fact that new ways of bypassing the models keep emerging is troubling in some ways.

One interesting study about the limitations of LLM aligment is arXiv:2304.11082 [cs.CL].

The authors attempt to define the fundamental limitation of alignment in existing Large Language Models such as ChatGPT. They are proposing that by design LLMs are bound to be breakable. Given any behavior that has any probability of being done by the model, there is a prompt that can achieve it.

In a way, they are suggesting that simply aligning might not be enough. Instead we would need to strictly prevent certain behaviors from being possible.

Is that possible? Probably yes. But it likely also means that the model would become even more restricted and potentially not as powerful as it is now.

Importantly, we prove that within the limits of this framework, for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt. This implies that any alignment process that attenuates an undesired behavior but does not remove it altogether, is not safe against adversarial prompting attacks.

To illustrate my point even further, let’s take a look at another study arXiv:2310.04451 [cs.CL]. The study is built on the concept of the jailbreak DAN, and aims to answer to the following question:

Can we develop an approach that can automatically generate stealthy jailbreak prompts?

At this point, to nobodies surprise the answer is Yes. They were able to create AutoDAN. AutoDAN can automatically generate stealthy jailbreak prompts using hierarchical genetic algorithm. It means that current patches are only patches. There will always be the next jailbreak prompt that the model will not be prepared for.

So where does this leave us? Is it time to halt the AI development to look for better solutions for alignment?

Even the top minds in the field are divided on the topic, just take a look at the list of people that signed to open letter to pause AI development in the beginning of 2023.

Personally, I don’t think we are there yet. The promise of LLMs becoming sentient or more powerful than human mind might be far stretched. It could be a dead end, and the limitations of the technology might be closer than we think.

On the other hand, we might be so close to AGI that we won’t have time to react. In theory, an AGI could learn to do anything a human can. If (even by accident) we make a breakthrough and AI can suddenly learn and improve on it’s own — it’s over. In that scenario Matrix might not even be such a far fetched idea.

We truly don’t know. Is that enough of a reason to slow down?

What if we slow down, and another party refusing to play by the rules develops unaligned AGI first. Or even if we develop it first, what would prevent someone creating an unhinged version eventually?

Are we doomed either way?

Humanity working to develop Artifical General Intelligence / Generated by DALL-E 3

--

--

Martins

Writing about technology, people and philosophy. Passionate about sustainability.