HyperAIHyperAI

Command Palette

Search for a command to run...

2 months ago

Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning

Zhaoyang Chu Yao Wan Zhikun Zhang Di Wang Zhou Yang Hongyu Zhang Pan Zhou Xuanhua Shi Hai Jin David Lo

Scrub It Out! Erasing Sensitive Memorization in Code Language Models via
  Machine Unlearning

Abstract

While Code Language Models (CLMs) have demonstrated superior performance insoftware engineering tasks such as code generation and summarization, recentempirical studies reveal a critical privacy vulnerability: these models exhibitunintended memorization of sensitive training data, enabling verbatimreproduction of confidential information when specifically prompted. To addressthis issue, several approaches, including training data de-duplication anddifferential privacy augmentation, have been proposed. However, these methodsrequire full-model retraining for deployed CLMs, which incurs substantialcomputational costs. In this paper, we aim to answer the following researchquestion: Can sensitive information memorized by CLMs be erased effectively andefficiently? We conduct a pioneering investigation into erasing sensitive memorization inCLMs through machine unlearning - a post-hoc modification method that removesspecific information from trained models without requiring full retraining.Specifically, we first quantify the memorization risks of sensitive data withinCLM training datasets and curate a high-risk dataset of 50,000 sensitivememorized samples as unlearning targets. We study two widely used gradientascent-based unlearning approaches: the vanilla and constraint-based methods,and introduce CodeEraser, an advanced variant that selectively unlearnssensitive memorized segments in code while preserving the structural integrityand functional correctness of the surrounding code. Extensive experiments onthree families of CLMs, i.e., CodeParrot, CodeGen-Mono, and Qwen2.5-Coder,validate the effectiveness and efficiency of CodeEraser in erasing targetedsensitive memorization while maintaining model utility.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning | Papers | HyperAI