HyperAI
14 days ago

Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs

Liang Zeng, Yongcong Li, Yuzhen Xiao, Changshi Li, Chris Yuhao Liu, Rui Yan, Tianwen Wei, Jujie He, Xuchen Song, Yang Liu, Yahui Zhou
Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in
  LLMs
Abstract

Software engineering (SWE) has recently emerged as a crucial testbed fornext-generation LLM agents, demanding inherent capabilities in two criticaldimensions: sustained iterative problem-solving (e.g., >50 interaction rounds)and long-context dependency resolution (e.g., >32k tokens). However, the datacuration process in SWE remains notoriously time-consuming, as it heavilyrelies on manual annotation for code file filtering and the setup of dedicatedruntime environments to execute and validate unit tests. Consequently, mostexisting datasets are limited to only a few thousand GitHub-sourced instances.To this end, we propose an incremental, automated data-curation pipeline thatsystematically scales both the volume and diversity of SWE datasets. Ourdataset comprises 10,169 real-world Python task instances from 2,531 distinctGitHub repositories, each accompanied by a task specified in natural languageand a dedicated runtime-environment image for automated unit-test validation.We have carefully curated over 8,000 successfully runtime-validated trainingtrajectories from our proposed SWE dataset. When fine-tuning the Skywork-SWEmodel on these trajectories, we uncover a striking data scaling phenomenon: thetrained model's performance for software engineering capabilities in LLMscontinues to improve as the data size increases, showing no signs ofsaturation. Notably, our Skywork-SWE model achieves 38.0% pass@1 accuracy onthe SWE-bench Verified benchmark without using verifiers or multiple rollouts,establishing a new state-of-the-art (SOTA) among the Qwen2.5-Coder-32B-basedLLMs built on the OpenHands agent framework. Furthermore, with theincorporation of test-time scaling techniques, the performance further improvesto 47.0% accuracy, surpassing the previous SOTA results for sub-32B parametermodels. We release the Skywork-SWE-32B model checkpoint to accelerate futureresearch.