Command Palette
Search for a command to run...
Imperceptible Jailbreaking against Large Language Models
Imperceptible Jailbreaking against Large Language Models
Kuofeng Gao Yiming Li Chao Du Xin Wang Xingjun Ma Shu-Tao Xia Tianyu Pang
Abstract
Jailbreaking attacks on the vision modality typically rely on imperceptibleadversarial perturbations, whereas attacks on the textual modality aregenerally assumed to require visible modifications (e.g., non-semanticsuffixes). In this paper, we introduce imperceptible jailbreaks that exploit aclass of Unicode characters called variation selectors. By appending invisiblevariation selectors to malicious questions, the jailbreak prompts appearvisually identical to original malicious questions on screen, while theirtokenization is "secretly" altered. We propose a chain-of-search pipeline togenerate such adversarial suffixes to induce harmful responses. Our experimentsshow that our imperceptible jailbreaks achieve high attack success ratesagainst four aligned LLMs and generalize to prompt injection attacks, allwithout producing any visible modifications in the written prompt. Our code isavailable at https://github.com/sail-sg/imperceptible-jailbreaks.