5 months ago

Zhaoyang Chu Yao Wan Zhikun Zhang Di Wang Zhou Yang Hongyu Zhang Pan Zhou Xuanhua Shi Hai Jin David Lo

Abstract

While Code Language Models (CLMs) have demonstrated superior performance insoftware engineering tasks such as code generation and summarization, recentempirical studies reveal a critical privacy vulnerability: these models exhibitunintended memorization of sensitive training data, enabling verbatimreproduction of confidential information when specifically prompted. To addressthis issue, several approaches, including training data de-duplication anddifferential privacy augmentation, have been proposed. However, these methodsrequire full-model retraining for deployed CLMs, which incurs substantialcomputational costs. In this paper, we aim to answer the following researchquestion: Can sensitive information memorized by CLMs be erased effectively andefficiently? We conduct a pioneering investigation into erasing sensitive memorization inCLMs through machine unlearning - a post-hoc modification method that removesspecific information from trained models without requiring full retraining.Specifically, we first quantify the memorization risks of sensitive data withinCLM training datasets and curate a high-risk dataset of 50,000 sensitivememorized samples as unlearning targets. We study two widely used gradientascent-based unlearning approaches: the vanilla and constraint-based methods,and introduce CodeEraser, an advanced variant that selectively unlearnssensitive memorized segments in code while preserving the structural integrityand functional correctness of the surrounding code. Extensive experiments onthree families of CLMs, i.e., CodeParrot, CodeGen-Mono, and Qwen2.5-Coder,validate the effectiveness and efficiency of CodeEraser in erasing targetedsensitive memorization while maintaining model utility.

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

5 months ago

Natural Language Processing

Task/Problem

Zhaoyang Chu Yao Wan Zhikun Zhang Di Wang Zhou Yang Hongyu Zhang Pan Zhou Xuanhua Shi Hai Jin David Lo

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

5 months ago

Natural Language Processing

Task/Problem

Zhaoyang Chu Yao Wan Zhikun Zhang Di Wang Zhou Yang Hongyu Zhang Pan Zhou Xuanhua Shi Hai Jin David Lo

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning

Zhaoyang Chu Yao Wan Zhikun Zhang Di Wang Zhou Yang Hongyu Zhang Pan Zhou Xuanhua Shi Hai Jin David Lo

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning

Zhaoyang Chu Yao Wan Zhikun Zhang Di Wang Zhou Yang Hongyu Zhang Pan Zhou Xuanhua Shi Hai Jin David Lo

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning

Zhaoyang Chu Yao Wan Zhikun Zhang Di Wang Zhou Yang Hongyu Zhang Pan Zhou Xuanhua Shi Hai Jin David Lo

Abstract

Build AI with AI

HyperAI Newsletters