SWE-bench Verified Code Generation Evaluation Benchmark Dataset
Date
Size
Publish URL
Tags
Categories
* This dataset supports online use.Click here to jump.
Dataset Introduction
The benchmark is an improved version (subset) of the existing SWE-bench, designed to more reliably evaluate the ability of AI models to solve real-world software problems.
To improve the robustness and reliability of SWE-bench, OpenAI launched a manual annotation campaign conducted by professional software developers to screen each sample in the SWE-bench test set to ensure that the scope of the unit test is appropriate and the problem description is clear and unambiguous.
Together with the authors of SWE-bench, they released SWE-bench Verified: a subset of the original SWE-bench test set containing 500 samples that have been verified by human annotators. This version replaces the original SWE-bench and SWE-bench Lite test sets.
On SWE-bench Verified, GPT-4o solved 33.2% samples, while the best-performing open source agent framework Agentless doubled its score to 16%.