UC Riverside Researchers Develop Method for AI to Forget Private and Copyrighted Data Without Original Training Data
A team of computer scientists at the University of California, Riverside has developed a groundbreaking method that allows artificial intelligence models to forget specific private or copyrighted data—without requiring access to the original training data. This innovation addresses growing concerns about the permanent retention of sensitive or protected content within AI systems, even after creators have attempted to remove or restrict it. The research, presented in July at the International Conference on Machine Learning in Vancouver and published on the arXiv preprint server, introduces a technique called "source-free certified unlearning." It enables AI developers to erase targeted information from models while preserving their overall performance and functionality. Unlike traditional methods that require retraining the model from scratch using the full original dataset—processes that are computationally expensive and energy-intensive—this new approach works even when the original data is no longer available. Ümit Yiğit Başaran, a doctoral student in electrical and computer engineering and the lead author, emphasized the practical challenges of accessing old training data. “In real-world situations, you can't always go back and get the original data,” he said. “We've created a certified framework that works even when that data is no longer available.” The method uses a surrogate dataset—statistically similar to the original but not identical—to simulate the impact of retraining. By adjusting model parameters and applying carefully calibrated random noise, the system ensures that the targeted data is effectively erased and cannot be reconstructed. The team’s approach includes a novel noise-calibration mechanism that accounts for differences between the original and surrogate data, improving the reliability of the forgetting process. Testing on both synthetic and real-world datasets showed that the method delivers privacy guarantees nearly as strong as full retraining, but with significantly lower computational demands. This makes it a viable solution for organizations seeking to comply with privacy regulations such as the European Union’s General Data Protection Regulation and California’s Consumer Privacy Act. The technique is particularly relevant in light of ongoing legal challenges. For example, The New York Times has sued OpenAI and Microsoft over the use of its copyrighted articles to train the GPT series of models. AI models often generate responses that closely resemble original text, allowing users to bypass paywalls and access content without authorization. The UCR team—comprising Başaran, professor Amit Roy-Chowdhury, and assistant professor Başak Güler—developed the method with support from Sk Miraj Ahmed, a computational science research associate at Brookhaven National Laboratory and a former UCR doctoral graduate. Roy-Chowdhury, co-director of UCR’s Riverside Artificial Intelligence Research and Education (RAISE) Institute, noted that while the current work applies to simpler models widely used today, future work aims to extend the technique to more complex systems like ChatGPT. Beyond compliance, the technology could empower media companies, healthcare providers, and other institutions to protect sensitive data embedded in AI systems. It also gives individuals a practical way to request the removal of personal information from AI models. “People deserve to know their data can be erased from machine learning models—not just in theory, but in provable, practical ways,” Güler said. The researchers plan to refine the method for larger models and develop user-friendly tools to help AI developers implement the technology globally. The paper is titled “A Certified Unlearning Approach without Access to Source Data.”
