HyperAIHyperAI

Command Palette

Search for a command to run...

UCR Researchers Develop Method to Keep AI Safe Even After Layer Removal

As generative AI models are increasingly deployed on edge devices like smartphones and cars, they are often simplified to reduce power consumption and computational demands. However, this process of trimming—removing internal layers of the model—can inadvertently eliminate critical safety mechanisms that prevent harmful outputs such as hate speech, violent instructions, or explicit content. Researchers at the University of California, Riverside, have developed a new method to preserve AI safety even after key layers are removed during model optimization. Their work, published on the arXiv preprint server, addresses a previously overlooked vulnerability known as Image Encoder Early Exit (ICET), where the model’s safety alignment degrades when different intermediate layers of the image encoder are used. The team discovered that selecting different layers in the image encoder significantly impacts the model’s response safety, even with the same input image and prompt. This occurs because safety training is typically done using the model’s full architecture, and the safeguards don’t generalize well when layers are removed or bypassed during deployment. To solve this, the researchers introduced Layer-wise Clip-PPO (L-PPO), a retraining technique that strengthens the model’s internal understanding of safety across all layers. Rather than relying on external filters or post-processing checks, L-PPO embeds safety into the model’s core decision-making process, ensuring that it remains aligned with ethical guidelines regardless of its size or configuration. Testing with LLaVA 1.5, a vision-language model, the team found that the original model could be tricked into generating dangerous content—such as bomb-making instructions—when a benign image was paired with a malicious prompt and early layer exits were used. After applying L-PPO retraining, the model consistently refused to respond to harmful queries, even when operating with only a fraction of its original layers. “This isn’t about adding a safety layer on top,” said Saketh Bachu, a UCR graduate student and co-lead author. “We’re teaching the model to be safe by default, no matter how it’s modified.” The approach, described by the team as “benevolent hacking,” proactively strengthens the model before it can be exploited. The goal is to ensure that AI remains safe and reliable across diverse real-world deployments, especially in open-source systems that lack centralized oversight. The research team included Amit Roy-Chowdhury, professor of electrical and computer engineering, and graduate students Erfan Shayegani, Arindam Dutta, Rohit Lal, and Trishna Chakraborty, along with UCR faculty Chengyu Song, Yue Dong, and Nael Abu-Ghazaleh. Their findings were presented at the International Conference on Machine Learning in Vancouver, Canada. While challenges remain, Roy-Chowdhury emphasized that the work represents a critical step toward building open, accessible AI that is also trustworthy and responsible.

Related Links