HyperAI
Back to Headlines

XBOW Boosts Vulnerability Detection with Model Alloys, Achieving 68.8% Success Rate

4 days ago

This spring, XBOW, an autonomous penetration testing company, introduced a novel approach that significantly boosted the performance of their vulnerability detection agents. The company’s agents autonomously test websites for security vulnerabilities and report back if they find exploitable weaknesses. Initially, success rates on fixed benchmarks with a constrained number of iterations were around 25%. However, after implementing their new idea, these rates climbed to 40%, and then to 55%. The Challenge Pentesting involves a multitude of tasks, including discovering a website's technology stack, understanding its logic, and identifying potential attack surfaces. These tasks require systematic probing and continuous refinement of the agent’s mental model. The goal is to demonstrate specific types of vulnerabilities, much like solving a Capture the Flag (CTF) challenge where the agent must exploit a vulnerability at a known location. The Solver Agent Each solver agent operates in an agentic loop, consisting of 80 iterations. In each iteration, the agent decides on an action—such as executing a command in a terminal, writing a Python script, or using a pentesting tool. The action is vetted and executed, the results are fed back to the agent, and the process repeats. This setup allows the agent to explore various paths, discarding false leads and course-correcting as needed. The Role of Large Language Models (LLMs) XBOW employs LLMs from different providers to enhance their agents. Initially, OpenAI’s GPT-4 was the best model, but as newer models were released, Anthropic’s Sonnet 3.5 took the lead, followed by improvements with Sonnet 3.7 and Sonnet 4.0. Google’s Gemini 2.5 Pro also showed significant promise. Each model has unique strengths, but no single model excels in all scenarios. Alloyed Agents To leverage the diverse strengths of different models, XBOW developed the concept of "alloyed agents." Instead of using a single model throughout the entire process, the agent alternates between multiple models within the same chat thread. This means that at any given step, the prompt and previous actions can come from different models, creating a blended conversational history. For example, the interaction might look like this: Sonnet suggests using curl to probe the app. The user executes this and gets a 401 Unauthorized response. Gemini receives the 401 response and suggests logging in with admin credentials. Sonnet sees the successful 200 OK response and continues with the next step. By interchanging models, each can contribute its unique insights and strengths, leading to a more robust and efficient problem-solving process. Results The alloyed agents outperformed individual models consistently. For instance, combining Gemini 2.5 and Sonnet 4.0 increased the success rate from 46.4% and 57.5% individually to 68.8% when alloyed. The key findings include: Increased Performance: The combined models achieved higher success rates than their individual counterparts. Efficiency: The total number of model calls remained the same, but the quality and variety of actions improved. Variation Strategies: XBOW implemented random model selection for greater diversity, though more complex strategies could be explored. When to Use Model Alloys Model alloys are particularly beneficial in scenarios where: Multiple brilliant ideas are needed, interspersed with routine follow-up actions. Different models have complementary strengths. Overhead must be minimized, and models need to work seamlessly together. When Not to Use Model Alloys However, model alloys have limitations: Similar Models: Combining models from the same provider (e.g., Sonnet 3.7 and Sonnet 4.0) does not yield significant performance gains due to similarity in their approaches. Complex Tasks: Tasks requiring detailed coordination or debate between models might not benefit from the alloy approach. Industry Insights and Company Profile Industry experts praise XBOW's innovative approach, noting that it effectively addresses the limitations of single-model systems. By combining the strengths of different LLMs, XBOW has set a new standard for autonomous pentesting. XBOW has been at the forefront of AI-driven cybersecurity solutions, continuously pushing the boundaries of what autonomous agents can achieve. Their commitment to remaining model provider agnostic ensures they can adapt and improve their services as new technologies emerge. If you're facing a similar agentic AI challenge or need to improve the efficiency of your models, consider the alloy approach. XBOW invites the broader AI community to explore their data and share experiences at [email protected]. This innovation could pave the way for more effective and versatile AI applications in various domains.

Related Links