Dataset Introduction

The benchmark is an improved version (subset) of the existing SWE-bench, designed to more reliably evaluate the ability of AI models to solve real-world software problems.

To improve the robustness and reliability of SWE-bench, OpenAI launched a manual annotation campaign conducted by professional software developers to screen each sample in the SWE-bench test set to ensure that the scope of the unit test is appropriate and the problem description is clear and unambiguous.

Together with the authors of SWE-bench, they released SWE-bench Verified: a subset of the original SWE-bench test set containing 500 samples that have been verified by human annotators. This version replaces the original SWE-bench and SWE-bench Lite test sets.

On SWE-bench Verified, GPT-4o solved 33.2% samples, while the best-performing open source agent framework Agentless doubled its score to 16%.

HyperAI

Use this Dataset

Discuss on Discord

Date

a year ago

Size

1.65 MB

Organization

Dataset Introduction

The benchmark is an improved version (subset) of the existing SWE-bench, designed to more reliably evaluate the ability of AI models to solve real-world software problems.

On SWE-bench Verified, GPT-4o solved 33.2% samples, while the best-performing open source agent framework Agentless doubled its score to 16%.

SWE-bench_Verified.torrent

Seeding 1Downloading 0Completed 239Total Downloads 335

SWE-bench_Verified/
- README.md
  1.68 KB
- README.txt
  3.37 KB

This dataset is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at [email protected] for prompt review and removal.

Related Datasets

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

Use this Dataset

Discuss on Discord

Date

a year ago

Size

1.65 MB

Organization

Dataset Introduction

The benchmark is an improved version (subset) of the existing SWE-bench, designed to more reliably evaluate the ability of AI models to solve real-world software problems.

On SWE-bench Verified, GPT-4o solved 33.2% samples, while the best-performing open source agent framework Agentless doubled its score to 16%.

SWE-bench_Verified.torrent

Seeding 1Downloading 0Completed 239Total Downloads 335

SWE-bench_Verified/
- README.md
  1.68 KB
- README.txt
  3.37 KB

Related Datasets

CL-bench Context Learning Evaluation Benchmark

5 days ago

OST-Bench Spatiotemporal Scene Understanding Benchmark Dataset

3 months ago

25.58 GB60

UNO-Bench full-modal Evaluation Benchmark Dataset

3 months ago

9.71 GB69

IF-Bench Infrared Image Understanding Benchmark Dataset

2 months ago

EditReward-Bench Image Editing Evaluation Dataset

3 months ago

5.08 GB61

Soul-Bench Audio-Driven Human Animation Evaluation Dataset

2 months ago

FrontierScience Inference Research Task Evaluation Dataset

2 months ago

GroundingME Complex Scene Understanding Evaluation Dataset

a month ago

OpenGU Graph Forgetting Comprehensive Evaluation Dataset

2 months ago

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

SWE-bench Verified Code Generation Evaluation Benchmark Dataset

Dataset Introduction

Build AI with AI

HyperAI Newsletters

Command Palette

SWE-bench Verified Code Generation Evaluation Benchmark Dataset

Dataset Introduction

Related Datasets

CL-bench Context Learning Evaluation Benchmark

OST-Bench Spatiotemporal Scene Understanding Benchmark Dataset

UNO-Bench full-modal Evaluation Benchmark Dataset

IF-Bench Infrared Image Understanding Benchmark Dataset

EditReward-Bench Image Editing Evaluation Dataset

Soul-Bench Audio-Driven Human Animation Evaluation Dataset

FrontierScience Inference Research Task Evaluation Dataset

GroundingME Complex Scene Understanding Evaluation Dataset

OpenGU Graph Forgetting Comprehensive Evaluation Dataset

Build AI with AI

HyperAI Newsletters

Command Palette

SWE-bench Verified Code Generation Evaluation Benchmark Dataset

Dataset Introduction

Related Datasets

CL-bench Context Learning Evaluation Benchmark

OST-Bench Spatiotemporal Scene Understanding Benchmark Dataset

UNO-Bench full-modal Evaluation Benchmark Dataset

IF-Bench Infrared Image Understanding Benchmark Dataset

EditReward-Bench Image Editing Evaluation Dataset

Soul-Bench Audio-Driven Human Animation Evaluation Dataset

FrontierScience Inference Research Task Evaluation Dataset

GroundingME Complex Scene Understanding Evaluation Dataset

OpenGU Graph Forgetting Comprehensive Evaluation Dataset

Build AI with AI

HyperAI Newsletters

Related Datasets

CL-bench Context Learning Evaluation Benchmark

OST-Bench Spatiotemporal Scene Understanding Benchmark Dataset

UNO-Bench full-modal Evaluation Benchmark Dataset

IF-Bench Infrared Image Understanding Benchmark Dataset

EditReward-Bench Image Editing Evaluation Dataset

Soul-Bench Audio-Driven Human Animation Evaluation Dataset

FrontierScience Inference Research Task Evaluation Dataset

GroundingME Complex Scene Understanding Evaluation Dataset

OpenGU Graph Forgetting Comprehensive Evaluation Dataset

Related Datasets

CL-bench Context Learning Evaluation Benchmark

OST-Bench Spatiotemporal Scene Understanding Benchmark Dataset

UNO-Bench full-modal Evaluation Benchmark Dataset

IF-Bench Infrared Image Understanding Benchmark Dataset

EditReward-Bench Image Editing Evaluation Dataset

Soul-Bench Audio-Driven Human Animation Evaluation Dataset

FrontierScience Inference Research Task Evaluation Dataset

GroundingME Complex Scene Understanding Evaluation Dataset

OpenGU Graph Forgetting Comprehensive Evaluation Dataset