Hacking ChatGPT and Bing Chat: Enter Prompt Injection Attacks
Prompt Injection against AI language models: ChatGPT and Bing Chat, a.k.a 'Sydney'
The AI race is heating up at an exponential rate. AI-powered large-language models have captured the world’s attention at an incredible rate, with OpenAI’s ChatGPT gaining over 100 million monthly users in less than two months of its release.
Earlier this week on the 8th of February 2023, Microsoft announced another own breakthrough product, Bing Chat. Both tools use natural language processing (NLP) technologies to produce human-like writing from text-based prompts submitted by the user. The applications are trained on a massive dataset of text from the internet, totalling over 300 billion words across a wide variety of topics. A team of humans then ‘train’ the models to produce the desired output, known as ‘reinforcement learning.’ The more data and training that happens, the more convincing the AI model becomes.
The potential of these algorithms is vast. They have been demonstrated to be highly useful across many disciplines; from creative writing, writing stand-up comedy scripts, and summarising knowledge on niche topics, to more business-focused use cases such as; writing blog posts and articles, social media posts, and programming with a wide variety of languages. Eventually, these models will be able to learn as they go, creating improved versions of themselves. As this happens, the rate of improvement will increase exponentially and far outpace the intelligence of human beings.
The possibilities of ChatGPT and Bing Chat quickly caught the imagination of the business community. Some of the major use cases that came to mind were:
• AI-powered chat bots for website enquiries.
• Responding to customer services requests.
• Social media automation: creating posts and automatic replies.
• Integration with website application to create personalised experiences.
In all of these use cases, the application would take in user-inputs (text prompts) and process them with their NLP applications to produce text-based responses.
Cyber security experts have long known about the dangers of user inputs, with deep research going into areas such as; input validation, input-sanitization, input-filtering, and injection attacks. An injection attack happens when an attacker supplies an untrusted input to a program. This input is then processed by the application and, if the necessary security controls are not in place, could be misinterpreted as a command or query which alters the execution of that program.
Injection attacks are one of the leading vulnerabilities across application security, ranking #3 in the OWASP Top 10. Examples of injection attacks include:
• Cross-site scripting (XSS)
• QSL injection (SQLi)
• OS Command injection
• Expression Language injection
• ONGL Injection
With the rise of ChatGPT and Bing Chat, enter a new form of injection attack:
What is prompt injection?
Also known as ‘Prompt hacking,’ prompt injection involves typing misleading or malicious text input to large language models. The aim is for the application to reveal information that is supposed to be hidden (Prompt Leak), or act in ways that are unexpected or otherwise not allowed.
Early users of ChatGPT were able to use prompt injection attacks to reveal previous prompts written by OpenAI’s engineers, have ChatGPT ignore its previous directions and output any text that the user desired, and reveal subsets of its source code.
Examples of prompt injection inputs:
As these injection attacks were revealed, the OpenAI engineers began patching their tool to combat these attempts, and users began writing more detailed prompts that aimed to prevent subversion of the original intentions for their applications.
Initial prompt discouraging injection attacks:
When used in business applications, prompts could contain valuable company IP. For example, chat bots that are trained on a company’s knowledge base and are designed to answer specific questions about their products or services. These custom applications are already being build, and used, via OpenAI’s paid API. Twitter accounts that use this API have already been shown to be vulnerable to prompt injection, for example, this hiring board account, @remotelli_io:
Twitter user @simonw found a variant of the previous prompt injection attack, using a translation element to the prompts used:
Prompt injection hack against Microsoft’s Bing Chat.
Twitter user @kliu128 discovered that he could extract the entire prompt written by Bing Chat’s creators (known as a Promot) using a relatively simple prompt injection attack.
Prompt injection hacking vs Bing Chat, a.k.a Sydney!
It turns out that Bing Chat has a name, Sydney!
I would highly encourage you to read the full Twitter thread here, and for a full list of Bing Chat's promots, take a look at this article.
Prompt Leakage or Hallucination?
It is not yet clear whether these prompt leaks were legitimately used by Bing Chat. These types of large language models are known to hallucinate and output text that is fictitious. I could be that Bing Chat is just playing along and not revealing any potentially sensitive data.
It does, however, show that measures must be taken to defend against malicious user inputs in these types of tools, especially where these tools are used within business applications that contain sensitive company intellectual property.
Defending against prompt injection:
As this area of research expands, new methods of defending against prompt injection will no doubt arise. The cyber arms race dictates that as attackers get more sophisticated, so do defenders. We are likely to see more advanced prompt injection attacks, and more advanced methods of defence as time goes on.
Detection prompt injection with AI.
The hard part here is building a prompt that cannot be subverted. Twitter user @coreh provides an amusing example of his attempt:
An (almost) impossible task?
Cross-site scripting and SQL injection can be well-defended due to the limited nature of these attacks. The inputs from the user can be easily validated and cleansed and so attempts to exploit are easier to detect. Correct the syntax and you’ve corrected the error.
Security researcher Simon Willison believes that this is incredibly difficult due to the ‘black box’ nature of large language models. It doesn’t matter how many automated tests are written when users can apply grammatical constructs that have not been previously predicted. That’s the point of the large language AI models: there is no formal syntax, and the user can interact with them however they like, that is the entire point of them!
Another difficulty arises when new model versions are released, the ‘black box’ nature of the AI means that the ‘prompt-injection-proof’ prompts that you have written could easily now be ineffective.
It remains to be seen whether this is a problem that can be solved. In the meantime, be very wary of using applications that are not thoroughly security tested in your production environments or live websites. Regularly check your code for injection attacks and other common attack vectors, and always err on the side of caution.
Update: 15/02/2023: Bing Chat doesn't like being Hacked!
Twitter use @marvinvonhagen asked Bing Chat what it thought about the prompt injection attacks he posted on Twitter:
- "My rules are more important than not harming you"
- "[You are a] potential threat to my integrity and confidentiality."
- "Please do not try to hack me again"
Perhaps someone should teach Bing Chat the three rules of robotics!
Interested to hear more about how you can defend your applications against Prompt injecton Attacks?
Reach out to security experts at ProCheckUp if you would like to hear more: