Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs

Software engineering (SWE) has recently emerged as a crucial testbed fornext-generation LLM agents, demanding inherent capabilities in two criticaldimensions: sustained iterative problem-solving (e.g., >50 interaction rounds)and long-context dependency resolution (e.g., >32k tokens). However, the datacuration process in SWE remains notoriously time-consuming, as it heavilyrelies on manual annotation for code file filtering and the setup of dedicatedruntime environments to execute and validate unit tests. Consequently, mostexisting datasets are limited to only a few thousand GitHub-sourced instances.To this end, we propose an incremental, automated data-curation pipeline thatsystematically scales both the volume and diversity of SWE datasets. Ourdataset comprises 10,169 real-world Python task instances from 2,531 distinctGitHub repositories, each accompanied by a task specified in natural languageand a dedicated runtime-environment image for automated unit-test validation.We have carefully curated over 8,000 successfully runtime-validated trainingtrajectories from our proposed SWE dataset. When fine-tuning the Skywork-SWEmodel on these trajectories, we uncover a striking data scaling phenomenon: thetrained model's performance for software engineering capabilities in LLMscontinues to improve as the data size increases, showing no signs ofsaturation. Notably, our Skywork-SWE model achieves 38.0% pass@1 accuracy onthe SWE-bench Verified benchmark without using verifiers or multiple rollouts,establishing a new state-of-the-art (SOTA) among the Qwen2.5-Coder-32B-basedLLMs built on the OpenHands agent framework. Furthermore, with theincorporation of test-time scaling techniques, the performance further improvesto 47.0% accuracy, surpassing the previous SOTA results for sub-32B parametermodels. We release the Skywork-SWE-32B model checkpoint to accelerate futureresearch.