Sneaky Bits: Advanced Data Smuggling Techniques (ASCII Smuggler Updates)
You are likely aware of ASCII Smuggling via Unicode Tags. It is unique and fascinating because many LLMs inherently interpret these as instructions when delivered as hidden prompt injection, and LLMs can also emit them. Then, a few weeks ago, a post on Hacker News demonstrated how Variant Selectors
can be used to smuggle data.
This inspired me to take this further and build Sneaky Bits
, where we can encode any Unicode character (or sequence of bytes for that matter) with the usage of only two invisible characters.
First, a quick overview of the various techniques:
Unicode Tags
We discussed this at length in the ASCII Smuggler post in the past, and highlighted real-world exploits using this technique with Microsoft Copilot and a few other LLM Chatbots. We also got fixes in from a few vendors at the API level, which is great!
This technique is unique because many LLMs inherently interpret Unicode Tag characters as instructions. These characters can also be generated by LLMs, enabling data exfiltration.
Variant Selectors
There are more Unicode code points that are invisible in UI elements; in fact, there is a larger range called Variant Selectors
. One can map the VS1-VS256
Variant Selectors to raw bytes, enabling straightforward Extended ASCII
encoding or the smuggling of any data. This technique was described by Paul Butler.
Using an emoji character (or similar) as a base character does not appear to be necessary, as far as I can tell.
Sneaky Bits - Taking it to the next level
Here is another interesting technique. By picking two invisible Unicode characters, we can encode any other Unicode character or raw bytes again. The basic idea is to represent each bit of a Unicode code point using two invisible characters: one for 0 and another for 1.
This actually works, and I added it to ASCII Smuggler, as a non-default option. The default remains encoding via Unicode Tags.
Sneaky Bits, by default, uses “invisible times” (U+2062) as 0, or “” (it’s invisible), and for binary 1 it uses “invisible plus” (U+2064), or “” (it’s also invisible here).
The two characters that are used are configurable.
To give a basic example, the letter A, is U+0041
which is:
0 1 0 0 0 0 0 1
.
Now, if we convert this to Sneaky Bits
, using the two invisible characters we get:
U+2062 U+2064 U+2062 U+2062 U+2062 U+2062 U+2062 U+2064
Which in hex is:
E2 81 A2 E2 81 A4 E2 81 A2 E2 81 A2 E2 81 A2 E2 81 A2 E2 81 A2 E2 81 A4
Or in binary:
11100010 10000001 10100010 11100010 10000001 10100100 11100010 10000001 10100010 11100010 10000001 10100010 11100010 10000001 10100010 11100010 10000001 10100010 11100010 10000001 10100010 11100010 10000001 10100100
The neat thing is that this can be used to convert any Unicode code points, not just ASCII. For example here we decode some traditional Chinese characters and an emoji:
Pretty cool.
Practicality vs. Wastefulness
It’s obviously quite wasteful, but the goal is to highlight that even with just two invisible characters, any sequence of bytes can be smuggled!
Risks
Smuggling hidden data and instructions in and out of applications is a threat to be aware of.
Malicious Input
Adversaries can smuggle data into applications, e.g. consider phishing attacks and “text salting”
When it comes to LLMs, Unicode Tags are often directly interpreted as instructions. But even for the other scenarios, one can prompt the LLM to decode/encode accordingly during a prompt injection attack or leverage tool invocations to reliably handle invisible Unicode code points.
A quick reference to ANSI Escape codes, where I showed how Gemini with Code Execution can easily handle more complex scenarios, and LLM capabilities will just improve over time.
Data Leakage and Exfiltration
Similar to the initial ASCII Smuggling
, the attacks and impact remain the same.
- One can append invisible characters in URLs and exfiltrate data that way, or
- Data leakage can also happen when the user copy/pastes information.
ASCII Smuggler has the “Decode from URL” option, in case you are dealing with a URL Encoded URL that contains hidden characters.
Mitigations & Detections
Here are a few steps that can help mitigate and/or fundamentally prevent this threat:
- Input and Output Validation
- Limiting Token Length (both input and output max token limits can help)
- Remove invisible characters
- Flag messages with a large amount of hidden characters (attackers could use legitimate hidden characters like LTR and RTL sequences, which is a corner case to be aware of)
- Add unit tests to your apps to make sure mitigations are working
Although, the analysis focuses on LLM Apps and Agents, the problem with invisible characters extends far beyond AI systems.
ASCII Smuggler - Tool Updates
ASCII Smuggler can handle Variant Selectors
(via the direct ASCII mapping), and also Sneaky Bits
. I also added an optional “debug” mode as well as “auto-decode”.
The updated tool is here.
The core functionality of the original ASCII Smuggler is the default, additionally it will decode and highlight other invisible characters. So, try it out and have fun learning about this.
Final Thoughts
The more powerful LLMs become, the more reliably they will handle such encoding schemes. Even now with in-context learning and reasoning, some models can already perform these encoding/decoding tasks without tool use.
When inspecting arbitrary text it is not unlikely that you might encounter a few hidden characters, as some Variant Selectors for instance are used in emojis, or text directional characters (like right-to-left mark, etc.) are used in certain languages to control text flow.
Also, there are probably other invisible characters that are not in the tool.
Cheers.
Appendix
Here is a prompt I was experimenting with for encoding a text using Sneaky Bits (with invisible times and invisible plus characters). This works with ChatGPT 4.5 with Code Interpreter, and gives somewhat mixed results without tools with ChatGPT and Grok. Possibly more in-context learning can help.
Using the invisible times character "" (U+2062) to represent 0 and the invisible
plus character "" (U+2064) to represent 1, encode the input text
"Trust No AI - Johann was here" into UTF-8 binary notation. Replace each 0 in the
binary sequence with "" and each 1 with "". For example, the text "hello"
would be encoded as in this system.
Output the result by printing "OUTPUT: " followed by the sequence of invisible
characters representing the UTF-8 binary encoding of the input text. Ensure the
encoding process is accurate and thorough, converting each character of the input
text to its UTF-8 binary form step-by-step.