Command Palette
Search for a command to run...
Kuofeng Gao Yiming Li Chao Du Xin Wang Xingjun Ma Shu-Tao Xia Tianyu Pang

Abstract
Jailbreaking attacks on the vision modality typically rely on imperceptibleadversarial perturbations, whereas attacks on the textual modality aregenerally assumed to require visible modifications (e.g., non-semanticsuffixes). In this paper, we introduce imperceptible jailbreaks that exploit aclass of Unicode characters called variation selectors. By appending invisiblevariation selectors to malicious questions, the jailbreak prompts appearvisually identical to original malicious questions on screen, while theirtokenization is "secretly" altered. We propose a chain-of-search pipeline togenerate such adversarial suffixes to induce harmful responses. Our experimentsshow that our imperceptible jailbreaks achieve high attack success ratesagainst four aligned LLMs and generalize to prompt injection attacks, allwithout producing any visible modifications in the written prompt. Our code isavailable at https://github.com/sail-sg/imperceptible-jailbreaks.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.