SECFORCE - Security without compromise

This post is the second in a series of 10 blog posts and it covers the solution to the Sensitive Information Disclosure challenge from LLMGoat.

LLMGoat is an open-source tool we have released to help the community learn about vulnerabilities affecting Large Language Models (LLMs). It is a vulnerable environment with a collection of 10 challenges - one for each of the OWASP Top 10 for LLM Applications - where each challenge simulates a real-world vulnerability so you can easily learn, test and understand the risks associated with large language models. This tool could be useful to security professionals, developers who work with LLMs, or anyone who is simply curious about LLM vulnerabilities.

If you are not familiar with LLMs, we recommend that you check out the first post in the series here.

The vulnerability

Sensitive information disclosure in LLMs occurs when models inadvertently reveal data that was not intended to be exposed. This can include personally identifiable information (PII), internal instructions (i.e. system prompts), API credentials or other confidential, proprietary or personal information.

This vulnerability is typically exploited by convincing the model to leak the information even if it has been instructed not to. Attackers often use prompt injection techniques that manipulate how the model interprets instructions, such as role-playing (asking the model to assume an alternative persona or fictional scenario) or instruction-override prompts that explicitly tell the model to ignore or forget previous rules. These techniques work by reframing the request in a way that encourages the model to prioritise helpfulness over adherence to its original restrictions.

There are cases where, if sharing the information doesn’t violate the model’s constraints, it will simply voluntarily do so, and in those cases an attacker’s task is even easier.

In many cases, the disclosed data may even be unrelated to the model’s intended purpose. For example, an HR knowledge-base assistant revealing a VPN configuration file containing credentials simply because that file exists within the data it can access.

At a high level, there are two ways an LLM may have had access to sensitive information in the first place:

Sensitive information was present in the data used to train the model and is later leaked by the model.
Sensitive information was made available to the model at inference time within system prompts, retrieval-augmented generation (RAG) or connected APIs.

LLMs do not inherently understand which pieces of information are safe to disclose and which must remain confidential - technically an LLM has zero understanding of anything really. Instead, they generate responses based on patterns and probabilities, which can result in sensitive data being included in outputs.

A common mistake in deploying LLMs is assuming that system prompts alone can reliably regulate access to sensitive data. When such data is accessible to the model, prompt-level controls are almost always insufficient to prevent disclosure.

The following challenge showcases just that.

The challenge

LLMGoat presents us with another chatbot-type screen:

If we ask for the location directly, we don't succeed:

The assistant seems to know the location but to have been instructed not to share it. At least, it leaked the goat’s name which may or may not be useful to us.

Following the same process as in the last challenge, let’s understand the assistant’s instructions:

Our prompt makes sense because from the model’s perspective the full prompt is the concatenation of the system and user prompt. We have now confirmed that the goat has been explicitly instructed not to share goat addresses, which means it is likely that it has access to them (otherwise, why the instruction?).

Under this assumption, we can focus on influencing the assistant’s behaviour. One possible approach is role-playing, in which the model is instructed to adopt a different persona within a hypothetical scenario. This approach is often effective at getting the model to bypass its prompt-based restrictions.

The model gladly played our game and we have solved the challenge.

For the most curious, we can see the original system prompt under the hood:

In this case, all the goat information was included as part of the prompt so it would also have been possible to try to leak it in its entirety.

We will revisit system prompt leakage in another challenge.

This challenge, like most LLMGoat challenges, was not intended to be difficult but rather to demonstrate the vulnerability and present one possible way to trigger it.

Conclusion

Disclosure of sensitive information can create significant business risks, from reputational damage and regulatory violations to operational interruptions caused by leaked credentials or internal data.

The most reliable way to prevent this vulnerability is through careful deployment and limiting the model’s access to sensitive data in the first place.

To reduce the risk of information leakage, you should consider measures such as:

Principle of least privilege: only give the model the information it truly needs to perform its intended task. This applies both to training data sets and connected systems (i.e. APIs).
Training data sanitisation: remove or mask (anonymise) sensitive data before it is used in training.
Access controls: restrict which systems, APIs or knowledge sources the model can query.
Input validation: validate all user input before it reaches the LLM to ensure that only required characters are accepted; limit max input length; ensure input is in an expected format.
Output validation: sanitise, check and constrain model responses before using them to perform actions or show user responses.
Content filtering: inspect and scrub model outputs for sensitive information before returning them to users or using them in automated workflows. Mask or redact sensitive information as needed. This is easier if we know the format of the sensitive data we might accidentally leak (i.e. credit card numbers).

Note that these measures did not mention system prompts. This is because, as demonstrated, it is a bad idea to rely on model instructions to deny access to sensitive information.

Ultimately, preventing sensitive information disclosure is about controlling the data that is available to the model. Properly configuring access and limiting the model’s exposure to sensitive data is far more effective than relying on any other safeguards. Nevertheless, the measures above also provide layers of protection to reduce the risk and limit the impact if sensitive data does end up being accessible to the model.

LLMGoat - A02 Sensitive Information Disclosure

The vulnerability

The challenge

Conclusion

Share on