a month ago

Imperceptible Jailbreaking against Large Language Models

Kuofeng Gao Yiming Li Chao Du Xin Wang Xingjun Ma Shu-Tao Xia Tianyu Pang

Abstract

Jailbreaking attacks on the vision modality typically rely on imperceptibleadversarial perturbations, whereas attacks on the textual modality aregenerally assumed to require visible modifications (e.g., non-semanticsuffixes). In this paper, we introduce imperceptible jailbreaks that exploit aclass of Unicode characters called variation selectors. By appending invisiblevariation selectors to malicious questions, the jailbreak prompts appearvisually identical to original malicious questions on screen, while theirtokenization is "secretly" altered. We propose a chain-of-search pipeline togenerate such adversarial suffixes to induce harmful responses. Our experimentsshow that our imperceptible jailbreaks achieve high attack success ratesagainst four aligned LLMs and generalize to prompt injection attacks, allwithout producing any visible modifications in the written prompt. Our code isavailable at https://github.com/sail-sg/imperceptible-jailbreaks.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Imperceptible Jailbreaking against Large Language Models

Kuofeng Gao Yiming Li Chao Du Xin Wang Xingjun Ma Shu-Tao Xia Tianyu Pang

Abstract

Build AI with AI

Hyper Newsletters