HyperAI

Semgrep’s latest security benchmark reveals that GLM-5.2, an open-weight model developed by Zhipu AI, outperformed Anthropic’s Claude Code on a rigorous vulnerability detection task. Released between June 13 and 16, 2026, the model achieved a 39 percent F1 score on Semgrep’s IDOR benchmark, surpassing Claude Code’s 32 percent while costing approximately one-sixth as much per identified vulnerability. The findings underscore a shifting landscape in AI-assisted cybersecurity, where open-weight architectures are closing the performance gap with proprietary frontier models while offering significant economic and deployment advantages. The evaluation focused on identifying Insecure Direct Object Reference flaws, a prevalent business-logic vulnerability where applications fail to validate user authorization before exposing internal data identifiers. Semgrep maintained a consistent dataset, evaluation methodology, and system prompt across all trials to isolate model capability from scaffolding effects. Open-weight models were tested in a minimal environment using only the source code and a standardized prompt, deliberately excluding the endpoint-discovery workflows and guided navigation that power Semgrep’s proprietary multimodal pipeline. Despite this constrained setup, GLM-5.2 delivered the highest performance among prompt-only open models, narrowly edging out the closed-source competitor. The top-performing configurations remained Semgrep’s internal harnesses leveraging GPT-5.5 and Claude Opus 4.8, which scored 61 and 53 percent respectively, confirming that specialized evaluation frameworks significantly amplify baseline model accuracy. GLM-5.2’s architectural design supports its security applications. Built on a Mixture-of-Experts framework, the model utilizes 750 billion total parameters with only 40 billion active per token, reducing inference costs while maintaining robust reasoning capabilities. Its one-million-token context window enables extended analysis across complex codebases, a critical feature for tracing authorization logic across multiple files and repositories. Distributed under an MIT license, the model permits unrestricted local deployment, fine-tuning, and inspection, addressing the data sovereignty and compliance requirements of security teams operating in regulated environments. At its published pricing tier, GLM-5.2 costs roughly $0.17 per detected vulnerability, making large-scale scanning economically viable compared to premium API calls. The benchmark results highlight a critical distinction between raw model capability and integrated tooling. While GLM-5.2 demonstrated remarkable efficiency and accuracy on IDOR detection, the performance delta between bare-prompt configurations and purpose-built harnesses remained the most significant factor in the rankings. Semgrep researchers caution that the dataset is finite and vulnerability detection carries inherent non-determinism, meaning results may vary across different flaw categories. Nevertheless, the test marks a notable milestone for the open-weights ecosystem. Where open-source alternatives once struggled to compete on specialized security workloads, GLM-5.2 proves that carefully optimized architectures can rival closed systems on reasoning-heavy tasks. For engineering teams, the findings suggest a pragmatic shift toward hybrid evaluation strategies, balancing cost, transparency, and harness design to optimize AI-driven vulnerability discovery without relying exclusively on premium, vendor-locked models.

Related Links

Related Links

Related Links

Meta Proposes AI Data Scientists, and Autodata Builds high-quality training/evaluation datasets.

Meta Proposes AI Data Scientists, and Autodata Builds high-quality training/evaluation datasets.

Command Palette

GLM 5.2 Outperforms Claude in Independent Cybersecurity Benchmarks

Related Links

Command Palette

GLM 5.2 Outperforms Claude in Independent Cybersecurity Benchmarks

Related Links

Command Palette

GLM 5.2 Outperforms Claude in Independent Cybersecurity Benchmarks

Related Links

Meta Proposes AI Data Scientists, and Autodata Builds high-quality training/evaluation datasets.

Meta Proposes AI Data Scientists, and Autodata Builds high-quality training/evaluation datasets.