HyperAI

Chinese DeepSeek R1 Distill Data 110k Chinese Based on DeepSeek-R1 Distillation Dataset

Date

4 months ago

Size

231.15 MB

Publish URL

huggingface.co

License

Apache 2.0

* This dataset supports online use.Click here to jump.

This dataset is a Chinese open source distilled full-blooded R1 dataset. The dataset contains not only math data, but also a large amount of general type data, with a total amount of 110K.

The reason for opening up this dataset is that the effect of R1 is very powerful, and the small model based on the R1 distilled data SFT also shows a strong effect, but a search found that most of the open-source R1 distilled datasets are English datasets. At the same time, the R1 report shows that some general scenario datasets are also used in the distillation model. In order to help everyone better reproduce the effect of the R1 distillation model, the Chinese dataset is open sourced.

The data distribution in this Chinese dataset is as follows:

  • Math: 36,987 samples in total,
  • Exam: 2,440 samples in total,
  • STEM: 12,000 samples in total,
  • General: A total of 58,573, including Retarded Bar, Logical Reasoning, Xiaohongshu, Zhihu, Chat, etc.

Field Description:

  • input: input
  • reasoning_content: Thinking
  • content: output
  • repo_name: data source
Chinese-DeepSeek-R1-Distill-data-110k.torrent
Seeding 1Downloading 0Completed 112Total Downloads 301
  • Chinese-DeepSeek-R1-Distill-data-110k/
    • README.md
      1.74 KB
    • README.txt
      3.48 KB
      • data/
        • Chinese-DeepSeek-R1-Distill-110k.zip
          231.15 MB