HyperAI

KodCode-V1 Encoding Synthetic Dataset

Date

2 months ago

Size

1.99 GB

Organization

Microsoft
University of Washington

License

CC BY 4.0

KodCode was released in 2025 by researchers from Microsoft GenAI, the University of Washington, and the University of Texas at Austin.KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding".

The dataset is the largest fully synthetic open-source dataset that provides verifiable solutions and tests for coding tasks. It contains 12 different subsets covering various fields (from algorithms to package-specific knowledge) and difficulty levels (from basic coding exercises to interviews and competitive programming challenges), and is designed for supervised fine-tuning (SFT) and RL tuning.

This figure illustrates the 3-step process of generating KodCode-V1: coding problem synthesis, solution and test generation, and post-training data synthesis. The final KodCode-V1 dataset contains 447K verified problem-solution-test triplets. The distribution of each subset is shown on the right.
KodCode-V1.torrent
Seeding 1Downloading 2Completed 24Total Downloads 26
  • KodCode-V1/
    • README.md
      1.61 KB
    • README.txt
      3.21 KB
      • data/
        • KodCode-V1.zip
          1.99 GB