Command Palette
Search for a command to run...
PromptCoT-2.0-SFT-4.8M Supervised fine-tuning Prompt SFT Dataset
PromptCoT-2.0-SFT-4.8M is a large-scale synthetic prompt dataset released by the research team of the University of Hong Kong and Ant Group in 2025. The related paper results are "PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning", which aims to provide high-quality reasoning prompt corpus for large language models for fine-tuning or self-training. The dataset contains approximately 4.8 million fully synthetic prompts with reasoning trajectories in both supervised fine-tuning and self-practice scenarios, covering two major reasoning areas: mathematics and programming.
Data composition:
- In the supervised fine-tuning (SFT) scenario, a total of 4,766,890 prompts were synthesized, including:
- 1,188,505 programming task prompts
- 3,578,385 math task prompts
Citation
@article{zhao2025promptcot2, title = {PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning}, author = {Zhao, Xueliang and Wu, Wei and Guan, Jian and Gong, Zhuocheng and Kong, Lingpeng} journal = {arXiv preprint arXiv:2509.19894}, year = {2025}, url = {https://arxiv.org/abs/2509.19894} }
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.