HyperAIHyperAI

Command Palette

Search for a command to run...

2 months ago

A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

A.S.E: A Repository-Level Benchmark for Evaluating Security in
  AI-Generated Code

Abstract

The increasing adoption of large language models (LLMs) in softwareengineering necessitates rigorous security evaluation of their generated code.However, existing benchmarks are inadequate, as they focus on isolated codesnippets, employ unstable evaluation methods that lack reproducibility, andfail to connect the quality of input context with the security of the output.To address these gaps, we introduce A.S.E (AI Code Generation SecurityEvaluation), a benchmark for repository-level secure code generation. A.S.Econstructs tasks from real-world repositories with documented CVEs, preservingfull repository context like build systems and cross-file dependencies. Itsreproducible, containerized evaluation framework uses expert-defined rules toprovide stable, auditable assessments of security, build quality, andgeneration stability. Our evaluation of leading LLMs on A.S.E reveals three keyfindings: (1) Claude-3.7-Sonnet achieves the best overall performance. (2) Thesecurity gap between proprietary and open-source models is narrow;Qwen3-235B-A22B-Instruct attains the top security score. (3) Concise,fast-thinking'' decoding strategies consistently outperform complex,slow-thinking'' reasoning for security patching.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code | Papers | HyperAI