SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

LLM-based agents have shown promising capabilities in a growing range ofsoftware engineering (SWE) tasks. However, advancing this field faces twocritical challenges. First, high-quality training data is scarce, especiallydata that reflects real-world SWE scenarios, where agents must interact withdevelopment environments, execute code and adapt behavior based on the outcomesof their actions. Existing datasets are either limited to one-shot codegeneration or comprise small, manually curated collections of interactivetasks, lacking both scale and diversity. Second, the lack of fresh interactiveSWE tasks affects evaluation of rapidly improving models, as static benchmarksquickly become outdated due to contamination issues. To address theselimitations, we introduce a novel, automated, and scalable pipeline tocontinuously extract real-world interactive SWE tasks from diverse GitHubrepositories. Using this pipeline, we construct SWE-rebench, a public datasetcomprising over 21,000 interactive Python-based SWE tasks, suitable forreinforcement learning of SWE agents at scale. Additionally, we use continuoussupply of fresh tasks collected using SWE-rebench methodology to build acontamination-free benchmark for agentic software engineering. We compareresults of various LLMs on this benchmark to results on SWE-bench Verified andshow that performance of some language models might be inflated due tocontamination issues.