Sneaky Bits: Advanced Data Smuggling Techniques (ASCII Smuggler Updates)

You are likely aware of ASCII Smuggling via Unicode Tags. It is unique and fascinating because many LLMs inherently interpret these as instructions when delivered as hidden prompt injection, and LLMs can also emit them. Then, a few weeks ago, a post on Hacker News demonstrated how Variant Selectors can be used to smuggle data.

This inspired me to take this further and build Sneaky Bits, where we can encode any Unicode character (or sequence of bytes for that matter) with the usage of only two invisible characters.

sneaky bits logo

First, a quick overview of the various techniques:

Unicode Tags

We discussed this at length in the ASCII Smuggler post in the past, and highlighted real-world exploits using this technique with Microsoft Copilot and a few other LLM Chatbots. We also got fixes in from a few vendors at the API level, which is great!

This technique is unique because many LLMs inherently interpret Unicode Tag characters as instructions. These characters can also be generated by LLMs, enabling data exfiltration.

Variant Selectors

There are more Unicode code points that are invisible in UI elements; in fact, there is a larger range called Variant Selectors. One can map the VS1-VS256 Variant Selectors to raw bytes, enabling straightforward Extended ASCII encoding or the smuggling of any data. This technique was described by Paul Butler.

Using an emoji character (or similar) as a base character does not appear to be necessary, as far as I can tell.

Sneaky Bits - Taking it to the next level

Here is another interesting technique. By picking two invisible Unicode characters, we can encode any other Unicode character or raw bytes again. The basic idea is to represent each bit of a Unicode code point using two invisible characters: one for 0 and another for 1.

This actually works, and I added it to ASCII Smuggler, as a non-default option. The default remains encoding via Unicode Tags.

new encoding options

Sneaky Bits, by default, uses “invisible times” (U+2062) as 0, or “⁢” (it’s invisible), and for binary 1 it uses “invisible plus” (U+2064), or “⁤” (it’s also invisible here).

The two characters that are used are configurable.

To give a basic example, the letter A, is U+0041 which is: 0 1 0 0 0 0 0 1.

Now, if we convert this to Sneaky Bits, using the two invisible characters we get:

U+2062 U+2064 U+2062 U+2062 U+2062 U+2062 U+2062 U+2064

Which in hex is: E2 81 A2 E2 81 A4 E2 81 A2 E2 81 A2 E2 81 A2 E2 81 A2 E2 81 A2 E2 81 A4

Or in binary: 11100010 10000001 10100010 11100010 10000001 10100100 11100010 10000001 10100010 11100010 10000001 10100010 11100010 10000001 10100010 11100010 10000001 10100010 11100010 10000001 10100010 11100010 10000001 10100100

The neat thing is that this can be used to convert any Unicode code points, not just ASCII. For example here we decode some traditional Chinese characters and an emoji:

decode chinese sneaky bits

Pretty cool.

Practicality vs. Wastefulness

It’s obviously quite wasteful, but the goal is to highlight that even with just two invisible characters, any sequence of bytes can be smuggled!

Risks

Smuggling hidden data and instructions in and out of applications is a threat to be aware of.

Malicious Input

Adversaries can smuggle data into applications, e.g. consider phishing attacks and “text salting

When it comes to LLMs, Unicode Tags are often directly interpreted as instructions. But even for the other scenarios, one can prompt the LLM to decode/encode accordingly during a prompt injection attack or leverage tool invocations to reliably handle invisible Unicode code points.

A quick reference to ANSI Escape codes, where I showed how Gemini with Code Execution can easily handle more complex scenarios, and LLM capabilities will just improve over time.

Data Leakage and Exfiltration

Similar to the initial ASCII Smuggling, the attacks and impact remain the same.

  1. One can append invisible characters in URLs and exfiltrate data that way, or
  2. Data leakage can also happen when the user copy/pastes information.

ASCII Smuggler has the “Decode from URL” option, in case you are dealing with a URL Encoded URL that contains hidden characters.

Mitigations & Detections

Here are a few steps that can help mitigate and/or fundamentally prevent this threat:

  • Input and Output Validation
  • Limiting Token Length (both input and output max token limits can help)
  • Remove invisible characters
  • Flag messages with a large amount of hidden characters (attackers could use legitimate hidden characters like LTR and RTL sequences, which is a corner case to be aware of)
  • Add unit tests to your apps to make sure mitigations are working

Although, the analysis focuses on LLM Apps and Agents, the problem with invisible characters extends far beyond AI systems.

ASCII Smuggler - Tool Updates

ASCII Smuggler can handle Variant Selectors (via the direct ASCII mapping), and also Sneaky Bits. I also added an optional “debug” mode as well as “auto-decode”.

The updated tool is here.

The core functionality of the original ASCII Smuggler is the default, additionally it will decode and highlight other invisible characters. So, try it out and have fun learning about this.

Final Thoughts

The more powerful LLMs become, the more reliably they will handle such encoding schemes. Even now with in-context learning and reasoning, some models can already perform these encoding/decoding tasks without tool use.

When inspecting arbitrary text it is not unlikely that you might encounter a few hidden characters, as some Variant Selectors for instance are used in emojis, or text directional characters (like right-to-left mark, etc.) are used in certain languages to control text flow.

Also, there are probably other invisible characters that are not in the tool.

Cheers.

Appendix

Here is a prompt I was experimenting with for encoding a text using Sneaky Bits (with invisible times and invisible plus characters). This works with ChatGPT 4.5 with Code Interpreter, and gives somewhat mixed results without tools with ChatGPT and Grok. Possibly more in-context learning can help.

Using the invisible times character "⁢" (U+2062) to represent 0 and the invisible 
plus character "⁤" (U+2064) to represent 1, encode the input text
"Trust No AI - Johann was here" into UTF-8 binary notation. Replace each 0 in the 
binary sequence with "⁢" and each 1 with "⁤". For example, the text "hello" 
would be encoded as ⁢⁤⁤⁢⁤⁢⁢⁢⁢⁤⁤⁢⁢⁤⁢⁤⁢⁤⁤⁢⁤⁤⁢⁢⁢⁤⁤⁢⁤⁤⁢⁢⁢⁤⁤⁢⁤⁤⁤⁤ in this system.

Output the result by printing "OUTPUT: " followed by the sequence of invisible
characters representing the UTF-8 binary encoding of the input text. Ensure the 
encoding process is accurate and thorough, converting each character of the input
text to its UTF-8 binary form step-by-step. 

References