HyperAIHyperAI

Command Palette

Search for a command to run...

Researchers Develop New Method to Test and Strengthen AI Safety by Probing Internal Systems

A new study led by University of Florida Professor Sumit Kumar Jha from the Computer & Information Science & Engineering department presents a groundbreaking approach to testing and improving the safety of artificial intelligence systems. Titled "Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion," the research introduces a method called Head-Masked Nullspace Steering, or HMNS, that probes AI models from within to uncover vulnerabilities in their internal safety mechanisms. Rather than relying solely on deceptive user prompts—commonly used in so-called "jailbreaking" attempts—HMNS examines the inner decision pathways of large language models (LLMs). By identifying the most influential components, or "heads," in the model’s neural network, the method selectively disables them and observes how the system responds. This allows researchers to see exactly how safety guardrails fail under pressure and where defenses are weakest. The research, accepted for presentation at the 2026 International Conference on Learning Representations in Rio de Janeiro, leverages the power of UF’s HiPerGator supercomputer to perform the complex computations required for such deep internal analysis. The team, including Ph.D. student Vishal Pramanik, Maisha Maliha from the University of Oklahoma, and Susmit Jha from SRI International, focused on models from Meta and Microsoft, two of the leading providers of public AI tools. HMNS outperformed existing attack methods across four major industry benchmarks. It succeeded in bypassing safety controls more frequently and with fewer attempts than previous techniques. What’s more, the method is computationally efficient—using less processing power than its rivals—making it a practical tool for evaluating defenses. To ensure fair comparisons, the researchers introduced a new metric called compute-aware reporting, which accounts for the amount of computational resources used in each attack. This approach highlights not just how effective an attack is, but also how efficiently it operates. Professor Jha stresses that the goal is not to enable misuse but to strengthen AI safety. “By showing exactly how these defenses break, we give developers the information they need to build systems that actually hold up,” he said. “AI is no longer just a tool—it’s becoming infrastructure in healthcare, finance, and everyday software. You can’t just test it with surface-level prompts and assume it’s safe.” The findings underscore a critical gap in current AI safety practices. While companies have implemented multiple layers of protection, these can be systematically bypassed when attackers understand the model’s internal structure. HMNS provides a way to stress-test those safeguards before they’re deployed in real-world applications. Ultimately, the research aims to shift AI safety from reactive fixes to proactive, rigorous testing. By "popping the hood" and examining the model’s inner workings, the team hopes to help developers build more resilient systems—ensuring that AI remains trustworthy as it becomes more deeply embedded in society.

Related Links