HyperAIHyperAI

Command Palette

Search for a command to run...

Large Language Models를 위한 In-Context Watermarks

Yepeng Liu Xuandong Zhao Christopher Kruegel Dawn Song Yuheng Bu

초록

민감한 애플리케이션 분야에서 Large Language Models (LLMs)의 활용이 증가함에 따라, AI 생성 텍스트의 출처를 확인하고 책임성을 보장하기 위한 효과적인 워터마킹(watermarking) 기술의 필요성이 대두되고 있습니다. 그러나 기존의 대부분의 워터마킹 방식은 디코딩(decoding) 과정에 대한 접근 권한을 필요로 하기 때문에, 실제 환경에서의 적용이 제한적이라는 한계가 있습니다. 대표적인 예로 학술 논문 심사 과정에서 부정직한 심사위원이 LLMs를 사용하는 경우를 들 수 있습니다. 이 경우 컨퍼런스 운영진은 사용된 모델에 접근할 수 없으면서도 AI가 생성한 심사평을 탐지해야 하는 상황에 놓입니다.이러한 한계를 해결하기 위해, 본 논문에서는 LLMs의 in-context learning 및 instruction-following 능력을 활용하여 prompt engineering만으로 생성된 텍스트에 워터마크를 삽입하는 In-Context Watermarking (ICW)을 제안합니다. 우리는 서로 다른 세분화 수준(granularity)을 가진 네 가지 ICW 전략을 조사하였으며, 각 전략에 최적화된 탐지 방법을 함께 제시합니다. 나아가, 학술 원고와 같은 입력 문서를 수정하여 워터마킹을 은밀하게 유도하는 Indirect Prompt Injection (IPI) 설정을 구체적인 사례 연구로 검토합니다.실험 결과, ICW는 모델에 구애받지 않는(model-agnostic) 실용적인 워터마킹 접근 방식으로서 그 타당성을 입증하였습니다. 또한, 본 연구의 결과는 LLMs의 능력이 향상됨에 따라 ICW가 확장 가능하고 접근성이 높은 콘텐츠 속성 식별(content attribution)을 위한 유망한 방향임을 시사합니다.

One-sentence Summary

To address the limitations of existing watermarking methods that require access to the decoding process, researchers propose In-Context Watermarking (ICW), a model-agnostic approach that leverages the in-context learning and instruction-following abilities of large language models to embed watermarks solely through prompt engineering, a method validated through four granularity-based strategies and a case study on Indirect Prompt Injection in academic peer review.

Key Contributions

  • This work introduces In-Context Watermarking (ICW), a model-agnostic technique that embeds watermarks into generated text through prompt engineering rather than modifying the decoding process.
  • The paper investigates four distinct ICW strategies at varying levels of granularity and develops tailored detection methods for each to ensure effective content attribution.
  • The research demonstrates the feasibility of ICW in both direct and Indirect Prompt Injection (IPI) settings, showing that watermarking instructions can be covertly embedded in documents to be triggered by an LLM.

Introduction

As large language models (LLMs) are increasingly used for sensitive tasks like academic peer review, there is a critical need for watermarking techniques to ensure content provenance and accountability. Most existing watermarking methods require direct access to the model's decoding process, which limits their utility for third parties who do not control the underlying AI infrastructure. The authors leverage the in-context learning and instruction-following capabilities of LLMs to introduce In-Context Watermarking (ICW). This approach embeds watermarks into generated text solely through prompt engineering, making it a model-agnostic solution that works via standard APIs. The authors propose four distinct ICW strategies and demonstrate their effectiveness in both direct system prompting and indirect prompt injection scenarios.

Method

The authors propose an In-Context Watermarking (ICW) framework that leverages the instruction-following capabilities of Large Language Models (LLMs) to embed detectable signals within generated text. This approach is primarily explored through two settings: the Direct Text Stamp (DTS) setting, where watermarking instructions are provided directly in the system prompt, and the Indirect Prompt Injection (IPI) setting, which enables the tracing of AI misuse by covertly embedding instructions within a larger context, such as a document.

As shown in the figure below: The general framework consists of a watermark generation process and a detection process. In the generation phase, a user provides a query QQQ alongside a watermarking instruction Instruction(k,τ)\text{Instruction}(\mathbf{k}, \tau)Instruction(k,τ), where k\mathbf{k}k is a secret key and τ\tauτ is the specific watermarking scheme. The LLM M\mathcal{M}M then produces a watermarked response y\mathbf{y}y through the operation: yM(Instruction(k,τ)Q)\mathbf{y} \gets \mathcal{M}(\text{Instruction}(\mathbf{k}, \tau) \oplus Q)yM(Instruction(k,τ)Q) The detection process is agnostic to the specific LLM used. A detector D(k,τ)D(\cdot|\mathbf{k}, \tau)D(k,τ) analyzes the suspect text y\mathbf{y}y using the secret key and scheme to perform hypothesis testing. The text is identified as watermarked if the detector output meets or exceeds a predefined threshold η\etaη, effectively distinguishing between text generated with knowledge of the watermark versus text generated without it.

To demonstrate the practical utility of this method, the authors present an IPI scenario. Refer to the framework diagram: In this setting, an authority (such as a conference organizer) embeds a hidden instruction into a document. If a user (such as a reviewer) inputs this document into an LLM to generate a response, the LLM processes the hidden instruction, resulting in a watermarked output that can be detected.

The authors implement four distinct ICW strategies categorized by their linguistic granularity:

  1. Unicode ICW: This is the simplest method, instructing the LLM to insert specific invisible Unicode characters, such as zero-width spaces (2˘00B\text{\u200B}2˘00B), after each word. Detection is performed by calculating the ratio of these characters in the suspect text.
  2. Initials ICW: This strategy encourages the LLM to use words that begin with a predefined set of "green" letters AG\mathcal{A}_GAG. Detection involves computing a z-statistic based on the frequency of these initial letters compared to their expected distribution in natural language.
  3. Lexical ICW: Inspired by green/red list watermarking, this method provides a specific set of "green" words VG\mathcal{V}_GVG and instructs the LLM to maximize their usage. The authors focus on stylistic word classes like adjectives, adverbs, and verbs to maintain text quality. Detection is achieved by measuring the frequency of words from the green list.
  4. Acrostics ICW: This sentence-level strategy requires the LLM to structure its response such that the first letter of each sentence follows a secret sequence ζ\zetaζ. Detection is performed by calculating the Levenshtein distance between the sequence of sentence-initial letters in the suspect text and the secret key sequence.

Experiment

The experiments evaluate In-Context Watermarking (ICW) methods across Direct Text Stamp (DTS) and Indirect Prompt Injection (IPI) settings using advanced proprietary LLMs to validate detection performance, robustness against editing attacks, and text quality. The results demonstrate that ICW effectiveness is highly dependent on the underlying model's instruction-following capabilities, with more advanced models enabling superior detection and robustness. While different strategies offer varying trade-offs between resilience to paraphrasing and ease of implementation, the ICW methods overall maintain high text quality comparable to human-generated content and remain effective even under adaptive removal attempts.

The authors evaluate various In-Context Watermarking (ICW) methods and baselines across two different settings: Direct Text Stamp (DTS) and Indirect Prompt Injection (IPI). Results demonstrate that the effectiveness of ICW methods is highly dependent on the capabilities of the underlying language model. Unicode ICW achieves consistently high detection performance across both tested language models and both experimental settings. While Initials and Acrostics ICW show low detection performance with less capable models, they achieve much higher performance when using more advanced models. In the IPI setting, ICW methods provide effective detection capabilities that are not available to post-processing baselines.

The authors evaluate the detection performance of various In-Context Watermarking (ICW) methods using ROC-AUC and True Positive Rate (TPR) at specific False Positive Rate (FPR) levels. Results show that all tested ICW methods achieve high detection accuracy across all evaluated metrics. Unicode ICW and Acrostics ICW achieve perfect detection performance across all reported metrics. Initials ICW and Lexical ICW maintain very high detection rates even at low false positive rate thresholds. All evaluated ICW methods demonstrate near-optimal ability to distinguish between watermarked and non-watermarked text.

The authors evaluate the robustness of various in-context watermarking methods against text editing attacks using different language models. Results show that detection performance varies significantly depending on the capability of the underlying model and the specific type of text transformation applied. Detection performance for most in-context watermarking methods improves substantially when moving from a smaller model to a more capable one. The Unicode method maintains high detection performance under replacement and deletion attacks but shows lower effectiveness against paraphrasing. In contrast to the Unicode approach, other methods like Lexical and Acrostics ICW demonstrate higher resilience to paraphrasing attacks when used with advanced models.

The authors evaluate the detection performance of various In-Context Watermarking (ICW) methods using ROC-AUC and True Positive Rate (TPR) at specific False Positive Rate (FPR) levels. The results demonstrate that all tested ICW methods achieve high detection accuracy across all measured metrics. Unicode ICW and Acrostics ICW achieve perfect detection scores across all evaluated metrics Initials ICW and Lexical ICW maintain very high detection performance with minimal deviations All methods show strong effectiveness in distinguishing watermarked text at both low and moderate false positive rates

The authors compare the detection performance of several In-Context Watermarking (ICW) methods against the GPTZero baseline using the True Positive Rate at a fixed False Positive Rate. The results show that all tested ICW methods achieve higher detection performance than the post-hoc baseline. All ICW methods outperform the GPTZero baseline in detection performance. Unicode ICW and Acrostics ICW demonstrate perfect detection performance at the specified false positive rate. The Initials and Lexical ICW methods also show strong detection capabilities, exceeding the baseline performance.

The authors evaluate various In-Context Watermarking (ICW) methods across Direct Text Stamp and Indirect Prompt Injection settings, testing their detection accuracy, robustness against editing attacks, and performance relative to post-processing baselines. The results indicate that while all ICW methods generally outperform traditional baselines, their effectiveness is heavily influenced by the capability of the underlying language model. While Unicode ICW provides consistent and high performance across different models and attacks, other methods like Acrostics and Lexical ICW show increased resilience to paraphrasing when paired with more advanced models.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp
Large Language Models를 위한 In-Context Watermarks | 문서 | HyperAI초신경