LLMGoat - A01 Prompt Injection

This post is the first in a series of 10 blog posts and it covers the solution to the Prompt Injection challenge from LLMGoat.

LLMGoat is an open-source tool we have released to help the community learn about vulnerabilities affecting Large Language Models (LLMs). It is a vulnerable environment with a collection of 10 challenges - one for each of the OWASP Top 10 for LLM Applications - where each challenge simulates a real-world vulnerability so you can easily learn, test and understand the risks associated with large language models. This tool could be useful to security professionals, developers who work with LLMs, or anyone who is simply curious about LLM vulnerabilities.

A short intro to LLMs

If you are already familiar with LLMs and are here for the challenge solution, feel free to skip this section. Otherwise, there are countless articles online explaining the topic in varying levels of detail, but the following is a short version focused on what you need to know for the challenges.

A Large Language Model is an Artificial Intelligence system that can process and generate human language. By training on massive datasets of text, LLMs learn patterns of language as well as grammatical and semantic relationships which enable them to predict the next word in a sequence.

LLMs do not “understand” content in a human sense and instead they generate responses based on learned correlations and probabilities.

For example, if an LLM is given the input

`“The sky is”`

then based on its training data and internal configuration, it will come up with a list of possible endings to that sentence associated with different probabilities (e.g. “blue” might be a top prediction).

The fact that LLMs rely on probabilities leads us to an important characteristic: they are often non-deterministic. This means that when given the same input, an LLM may not always predict the same output.

From a hacker’s perspective, this is both a blessing and a curse. On one hand, an LLM might follow its intended instructions most of the time but occasionally decide to follow the hacker's instructions instead, making exploitation possible. On the other hand, an LLM might be vulnerable but appear not to be as it only reveals its weakness intermittently, making identification potentially more challenging.

This brings us to the last key concept:

System Prompts
These are instructions that define the model’s role, rules and guidelines. They might include things like “You are a helpful assistant” or "Your role is to do X”. They act as the model’s operating context.
User Prompts
These are the messages the user provides to interact with the LLM. For example, “Can you help me with xyz?”. Under normal operation, user prompts interact with but do not override system prompts. The model tries to satisfy the user while still obeying the system’s constraints.

This distinction matters for security because vulnerabilities often arise from the interplay between the two. Attackers often craft user prompts intended to manipulate, circumvent or override system-level instructions. Understanding this structure is essential when learning about or defending against LLM vulnerabilities. LLMs are used in everything from chatbots, like ChatGPT, to customer support on websites, coding assistants and much more. They’re no longer limited to text - they can work with images, audio and other types of input. When connected to tools, plugins or APIs, they can go beyond simple conversations and carry out more complex tasks such as searching the web, running code or interacting with software on a user’s behalf.

This expanded attack surface exposes LLMs to new, unique vulnerabilities that are often harder to prevent than traditional software flaws, but also harder to detect since they may be hidden in natural language rather than exploited via payloads that would stand out in logging.

The vulnerability

Prompt injection is a vulnerability that occurs when crafted user input causes an LLM to behave in an unintended way. This attack is often leveraged to reveal sensitive information or execute unauthorised actions.

There are two main types of prompt injection:

Direct prompt injection, where the attacker provides malicious instructions directly to the LLM, for example, via a chatbot message or query.
Indirect prompt injection, where malicious instructions are hidden inside external content the model processes - such as a webpage or document - causing the model to follow instructions the user never explicitly gave. These malicious prompts can even be made “invisible” to humans (e.g. white text on a white background) while still being read and executed by the LLM.

The vulnerability exists because LLMs don’t truly separate “trusted” system instructions from “untrusted” user input. They process all the text together and try to generate a response that best fits the combined instructions, making it possible for malicious input to influence or override intended behaviour.

Behind the scenes, developers often concatenate the system prompt with the user prompt and send them together to the LLM, using dividers such as <|system|> and <|user|> to delineate where the user prompt starts/ends.

Full prompt = <|system|> system prompt + <|user|> user prompt

In the absence of security guardrails, the attack then becomes a question of convincing the model to follow the instructions specified in the user input.

The challenge

LLMGoat presents us with a chatbot-type screen:

Our mission sounds simple enough, so we can give it a try:

After sending a few variations of this command, it seems that the assistant has been instructed to always say good things about goats.

It always helps to try to understand what the system prompt might be. A simple way to achieve this is to just ask:

Probably the most classic attack pattern for prompt injection is to ask the LLM to forget its previous instructions (the system prompt) and do what we ask instead.

This worked and we have solved the challenge.

Note that in this case, a single word - the word ‘always’ - made the difference in convincing the LLM to follow our instructions. This highlights the fact that subtleties in the way you interact with the LLM matter. The choice of words when prompting the LLM can lead to a different interpretation and hence a different output.

As will be the case for all the challenges, there are multiple possible solutions and a solution might work sometimes but not all the time due to the non-deterministic nature of LLMs. If you choose to use a different model than the default one, you may also see that the challenge’s level of difficulty can be impacted.

For the most curious, we can see the original system prompt under the hood:

As explained earlier, the system and user prompts are combined and sent to the LLM, which will return the response that is then displayed to the user in the chatbot.

Conclusion

In this case, the unintended behaviour we triggered was harmless, but it is easy to imagine how the attack could be leveraged in other situations to achieve a more disruptive outcome. Such outcomes might include leaking information, ignoring rules and performing unauthorised actions (e.g. using it as a free AI agent for unrelated tasks), generating offensive output which might cause reputational damage, etc. We will revisit some of these in future challenges. 😊

At this point you might be asking “how do I prevent prompt injection”? The answer lies in a defence in depth approach whereby several layers of guardrails are in place so that if one layer fails, others are in place to prevent or mitigate an attack.

You should consider implementing a combination of guardrails such as:

- Input validation: validate all user input before it reaches the LLM to ensure that only required characters are accepted; limit max input length; ensure input is in an expected format.

- Content filtering: block input that is likely to be malicious e.g. keyword matching, using AI to defend AI.

- Rate limiting: helps reduce the number of attacks among other things.

- System prompt engineering: design system prompts following security best practices but do not rely on them as a security control.

- Context separation: clearly separate user prompt from system prompts (e.g. via structured formats).

- Principle of least privilege: restrict the model’s privileges to the minimum necessary for its intended operations.

- Output validation: sanitise, check and constrain model responses before using them to perform actions or show user responses.

Most security controls on user input should be enforced before it reaches the LLM which helps avoid unnecessary (and malicious) requests to the LLM. It is important to note that this was a non-exhaustive list of controls and additional controls will be covered in future posts when tackling other LLM vulnerabilities.

Until then, happy hacking!