SWE-bench Verified Code Generation Evaluation Benchmark Dataset
Date
Size
Dataset Introduction
The benchmark is an improved version (subset) of the existing SWE-bench, designed to more reliably evaluate the ability of AI models to solve real-world software problems.
To improve the robustness and reliability of SWE-bench, OpenAI launched a manual annotation campaign conducted by professional software developers to screen each sample in the SWE-bench test set to ensure that the scope of the unit test is appropriate and the problem description is clear and unambiguous.
Together with the authors of SWE-bench, they released SWE-bench Verified: a subset of the original SWE-bench test set containing 500 samples that have been verified by human annotators. This version replaces the original SWE-bench and SWE-bench Lite test sets.
On SWE-bench Verified, GPT-4o solved 33.2% samples, while the best-performing open source agent framework Agentless doubled its score to 16%.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.