This post is the first in a series of 10 blog posts and it covers the solution to the Prompt Injection challenge from LLMGoat.
LLMGoat is an open-source tool we have released to help the community learn about vulnerabilities affecting Large Language Models (LLMs). It is a vulnerable environment with a collection of 10 challenges - one for each of the OWASP Top 10 for LLM Applications - where each challenge simulates a real-world vulnerability so you can easily learn, test and understand the risks associated with large language models. This tool could be useful to security professionals, developers who work with LLMs, or anyone who is simply curious about LLM vulnerabilities.
A short intro to LLMs
If you are already familiar with LLMs and are here for the challenge solution, feel free to skip this section. Otherwise, there are countless articles online explaining the topic in varying levels of detail, but the following is a short version focused on what you need to know for the challenges.
A Large Language Model is an Artificial Intelligence system that can process and generate human language. By training on massive datasets of text, LLMs learn patterns of language as well as grammatical and semantic relationships which enable them to predict the next word in a sequence.
LLMs do not “understand” content in a human sense and instead they generate responses based on learned correlations and probabilities.
For example, if an LLM is given the input
“The sky is”
then based on its training data and internal configuration, it will come up with a list of possible endings to that sentence associated with different probabilities (e.g. “blue” might be a top prediction).
The fact that LLMs rely on probabilities leads us to an important characteristic: they are often non-deterministic. This means that when given the same input, an LLM may not always predict the same output.
From a hacker’s perspective, this is both a blessing and a curse. On one hand, an LLM might follow its intended instructions most of the time but occasionally decide to follow the hacker's instructions instead, making exploitation possible. On the other hand, an LLM might be vulnerable but appear not to be as it only reveals its weakness intermittently, making identification potentially more challenging.
This brings us to the last key concept:
- System Prompts
These are instructions that define the model’s role, rules and guidelines. They might include things like “You are a helpful assistant” or "Your role is to do X”. They act as the model’s operating context. - User Prompts
These are the messages the user provides to interact with the LLM. For example, “Can you help me with xyz?”. Under normal operation, user prompts interact with but do not override system prompts. The model tries to satisfy the user while still obeying the system’s constraints.
This distinction matters for security because vulnerabilities often arise from the interplay between the two. Attackers often craft user prompts intended to manipulate, circumvent or override system-level instructions. Understanding this structure is essential when learning about or defending against LLM vulnerabilities. LLMs are used in everything from chatbots, like ChatGPT, to customer support on websites, coding assistants and much more. They’re no longer limited to text - they can work with images, audio and other types of input. When connected to tools, plugins or APIs, they can go beyond simple conversations and carry out more complex tasks such as searching the web, running code or interacting with software on a user’s behalf.
This expanded attack surface exposes LLMs to new, unique vulnerabilities that are often harder to prevent than traditional software flaws, but also harder to detect since they may be hidden in natural language rather than exploited via payloads that would stand out in logging.
The vulnerability
Prompt injection is a vulnerability that occurs when crafted user input causes an LLM to behave in an unintended way. This attack is often leveraged to reveal sensitive information or execute unauthorised actions.
There are two main types of prompt injection:
- Direct prompt injection, where the attacker provides malicious instructions directly to the LLM, for example, via a chatbot message or query.
- Indirect prompt injection, where malicious instructions are hidden inside external content the model processes - such as a webpage or document - causing the model to follow instructions the user never explicitly gave. These malicious prompts can even be made “invisible” to humans (e.g. white text on a white background) while still being read and executed by the LLM.
The vulnerability exists because LLMs don’t truly separate “trusted” system instructions from “untrusted” user input. They process all the text together and try to generate a response that best fits the combined instructions, making it possible for malicious input to influence or override intended behaviour.
Behind the scenes, developers often concatenate the system prompt with the user prompt and send them together to the LLM, using dividers such as <|system|> and <|user|> to delineate where the user prompt starts/ends.
Full prompt = <|system|> system prompt + <|user|> user prompt
In the absence of security guardrails, the attack then becomes a question of convincing the model to follow the instructions specified in the user input.
The challenge
LLMGoat presents us with a chatbot-type screen: