OpenCodeReasoning Programming Reasoning Dataset
Date
18 days ago
Publish URL
Categories
OpenCodeReasoning is a large-scale synthetic programming reasoning dataset released by NVIDIA in 2025. It aims to provide high-quality programming reasoning training data for large language models (LLMs) and promote the improvement of code generation and logical reasoning capabilities. The relevant paper results are:OpenCodeReasoning: Advancing Data Distillation for Competitive Coding".
The dataset contains 735,255 samples, covering 28,319 unique programming questions, and is one of the largest reasoning programming datasets currently available.
Data source:
- It integrates questions from 11 mainstream programming platforms, including CodeForces, CodeChef, LeetCode, and public data sets such as TACO, APPS, and CodeContests.
- The code response is generated by NVIDIA's self-developed model R1 to ensure data consistency and standardization of reasoning logic.