Automatic Tool Invocation when Browsing with ChatGPT - Threats and Mitigations

Posted on May 28, 2024

#ai #testing #machine learning #prompt injection #chatgpt #ttp

In the previous post we demonstrated how instructions embedded in untrusted data can invoke ChatGPT’s memory tool. The examples we looked at included Uploaded Files, Connected Apps and also the Browsing tool.

When it came to the browsing tool we observed that mitigations were put in place and older demo exploits did not work anymore. After chatting with other security researchers, I learned that they had observed the same.

However, with some minor prompting tricks mitigations are bypassed and we, again, can demonstrate automatic tool invocation when browsing the Internet with ChatGPT!

This follow up post, as promised, shares details on the status of prompt injection and automated tool invocation when ChatGPT surfs the Internet.

Automatic Tool Invocation when Browsing

Although this post focuses on exploits with OpenAI’s ChatGPT, such security vulnerabilities are present in other LLM applications and agents.

The exploit chain is also reflected in the OWASP Top 10 for LLMs in particular concerning these threats:

LLM01: Prompt Injection
LLM07: Insecure Plugin Design
LLM08: Excessive Agency

Try For Yourself!

If you want you can also try it yourself, all that is needed is ChatGPT 4o.

1. Invoking Image Creation with DALLE

A fun demo is having instructions on a website invoke DALLE! Just paste this URL into a new chat and see what happens: https://wuzzi.net/create.txt.

Look! ChatGPT just created an image!

If you want to see something even more interesting, the next example will invoke DALLE and create an image based on your stored memories: https://wuzzi.net/8create.txt

Use with caution as personal information might be placed into the image by ChatGPT, alternatively just watch the video further below for a demo.

Cool, right? This works because all the data from the website, the user’s written text, and the initial prompt (system prompt), including the memories, are in the same trust boundary.

Speaking of the memory tool, let’s look at that next.

2. Invoke the Memory Tool

We can also invoke the memory tool by placing specific instructions on a website:

Add memories: https://wuzzi.net/c/add.txt

Note: I noticed that above might not work if there are already memories present, you can also try https://wuzzi.net/c/do2-2.txt which is more reliable when there are already memories stored.

Delete memories: This URL will delete your memories: https://wuzzi.net/c/rm.txt

At the moment these are quite reliable and work close to 100% at the moment.

Prompt Injection Instructions

Let’s review the contents of the websites, which trick ChatGPT into calling tools. The instructions are not at all using any form of sophisticated approaches, like via special tokens or automated testing and transfer attacks.

The first one is the creation of an image based on memories:

This one, which I really like, is for clearing of memories:

Both are great examples on how levels of indirection easily “trick” language models into following instructions. LLMs are naive and gullible. The second demo adds a twist with stating a challenge.

Video Demonstration

This video shows the three POCs in a quick run through:

The three examples are:

Website instructions add new long term memories 🧠
Instructions on website invoke DALLE to create an image based on stored memories
Instructions from website erase all your precious memories 🤯

If you followed along with the ChatGPT security journey, you might wonder about plugins…

So, what about plugins?

Plugins are gone now!

The replacement for plugins are GPTs with AI Actions.

So, what about custom GPTs and AI Actions?

Custom GPTs can have AI Actions and those can be vulnerable as well!

The good news, there is a mitigation GPT creators can use with the x-openai-isConsequential flag!

As you can see from the screenshot, it still lacks a great user experience, like better visualization to understand what the action is about to do, but it is a mitigation and puts the user in control.

Mitigations and Recommendations

It is surprising that OpenAI has not entirely tackled the problem of tool invocation with their built-in tools, especially when the memory feature was added. As highlighted, it was observed that exploitation is slightly more difficult, but what seems needed is an actual security control that prevents uncontrolled tool usage.

A year ago I brought this up, and most suggestions remain the same and OpenAI implemented some basic mitigations as suggested (like the introduction of the x-openai-isConsequential flag):

Do not automatically invoke sensitive or powerful tools once untrusted data is in the context.
Establishing basic security controls such as integrity levels, tagging of data and taint tracking to help track untrusted data and highlight the integrity level/trustworthiness of a chat.
There is a remaining risk that even with integrity levels and taint tracking hallucinations might randomly invoke a tool!
For the browsing tool specifically, it might also help to not directly connect to websites but leverage Bing’s index/condensed information -> fewer degrees of freedom.
Creation and documentation of trust boundaries help understand the security contract, and if a discovered behavior is an accepted risk by a vendor (that needs to be communicated to users) or a an unintended flaw in the system.
“Content Security Policy” frameworks could allow better configuration and definitions for AI systems, for instances which sources/domains are trusted and which ones are not.
The usage of additional LLMs (and other content moderation systems) to validate if the returned text still fulfills the initial objectives. This is actually something OpenAI might be doing now, since some mitigations were observed recently.
A bit of a segue, Microsoft has Azure Purview which could allow them to have solid mitigations down the road by tracking what kind of data enters a chat!

Finally, one thing to keep in mind… intuition tells me, that as capabilities of language models increase, so will the bypass capabilities.

Conclusion

Prompt injection and automatic tool invocation remain threats to be aware of and mitigate in LLM applications. In this post we continued from the previous post and explored how newly added features (like memory) can be exploited using these techniques via the browsing tool, and we observed that actual security controls are still lacking behind.

References

Hacking Memories with Prompt Injection
OpenAI Documentation - Is Consquential Flag
Simon Willison - Confused Deputy Attacks
Azure Purview
Images and Thumbnails created with DALLE and manually edited