GLM 5.2 Outperforms Claude in Independent Cybersecurity Benchmarks
Semgrep’s latest security benchmark reveals that GLM-5.2, an open-weight model developed by Zhipu AI, outperformed Anthropic’s Claude Code on a rigorous vulnerability detection task. Released between June 13 and 16, 2026, the model achieved a 39 percent F1 score on Semgrep’s IDOR benchmark, surpassing Claude Code’s 32 percent while costing approximately one-sixth as much per identified vulnerability. The findings underscore a shifting landscape in AI-assisted cybersecurity, where open-weight architectures are closing the performance gap with proprietary frontier models while offering significant economic and deployment advantages. The evaluation focused on identifying Insecure Direct Object Reference flaws, a prevalent business-logic vulnerability where applications fail to validate user authorization before exposing internal data identifiers. Semgrep maintained a consistent dataset, evaluation methodology, and system prompt across all trials to isolate model capability from scaffolding effects. Open-weight models were tested in a minimal environment using only the source code and a standardized prompt, deliberately excluding the endpoint-discovery workflows and guided navigation that power Semgrep’s proprietary multimodal pipeline. Despite this constrained setup, GLM-5.2 delivered the highest performance among prompt-only open models, narrowly edging out the closed-source competitor. The top-performing configurations remained Semgrep’s internal harnesses leveraging GPT-5.5 and Claude Opus 4.8, which scored 61 and 53 percent respectively, confirming that specialized evaluation frameworks significantly amplify baseline model accuracy. GLM-5.2’s architectural design supports its security applications. Built on a Mixture-of-Experts framework, the model utilizes 750 billion total parameters with only 40 billion active per token, reducing inference costs while maintaining robust reasoning capabilities. Its one-million-token context window enables extended analysis across complex codebases, a critical feature for tracing authorization logic across multiple files and repositories. Distributed under an MIT license, the model permits unrestricted local deployment, fine-tuning, and inspection, addressing the data sovereignty and compliance requirements of security teams operating in regulated environments. At its published pricing tier, GLM-5.2 costs roughly $0.17 per detected vulnerability, making large-scale scanning economically viable compared to premium API calls. The benchmark results highlight a critical distinction between raw model capability and integrated tooling. While GLM-5.2 demonstrated remarkable efficiency and accuracy on IDOR detection, the performance delta between bare-prompt configurations and purpose-built harnesses remained the most significant factor in the rankings. Semgrep researchers caution that the dataset is finite and vulnerability detection carries inherent non-determinism, meaning results may vary across different flaw categories. Nevertheless, the test marks a notable milestone for the open-weights ecosystem. Where open-source alternatives once struggled to compete on specialized security workloads, GLM-5.2 proves that carefully optimized architectures can rival closed systems on reasoning-heavy tasks. For engineering teams, the findings suggest a pragmatic shift toward hybrid evaluation strategies, balancing cost, transparency, and harness design to optimize AI-driven vulnerability discovery without relying exclusively on premium, vendor-locked models.
