HyperAI

SWE-bench Verified Code Generation Evaluation Benchmark Dataset

Date

10 months ago

Size

1.65 MB

Organization

OpenAI
Stanford University

Publish URL

huggingface.co

* This dataset supports online use.Click here to jump.

Dataset Introduction

The benchmark is an improved version (subset) of the existing SWE-bench, designed to more reliably evaluate the ability of AI models to solve real-world software problems.

To improve the robustness and reliability of SWE-bench, OpenAI launched a manual annotation campaign conducted by professional software developers to screen each sample in the SWE-bench test set to ensure that the scope of the unit test is appropriate and the problem description is clear and unambiguous.

Together with the authors of SWE-bench, they released SWE-bench Verified: a subset of the original SWE-bench test set containing 500 samples that have been verified by human annotators. This version replaces the original SWE-bench and SWE-bench Lite test sets.

On SWE-bench Verified, GPT-4o solved 33.2% samples, while the best-performing open source agent framework Agentless doubled its score to 16%.

SWE-bench_Verified.torrent
Seeding 2Downloading 0Completed 125Total Downloads 126
  • SWE-bench_Verified/
    • README.md
      1.68 KB
    • README.txt
      3.37 KB
      • data/
        • SWE-bench_Verified.zip
          1.65 MB