HyperAI

COIG-CQIA High-quality Chinese Instruction Fine-tuning Dataset

Date

a year ago

Size

88.8 MB

Organization

Zero One Everything

Publish URL

huggingface.co

特色图像

COIG-CQIA stands for Chinese Open Instruction Generalist – Quality is All You Need. It is an open source high-quality instruction fine-tuning dataset.Aims to provide the Chinese NLP community with high-quality instruction fine-tuning data that is consistent with human interaction behavior. COIG-CQIA uses questions and answers and articles obtained from the Chinese Internet as raw data, and is constructed after deep cleaning, reconstruction, and manual review.

This project is inspired by studies such as LIMA: Less Is More for Alignment. Using a small amount of high-quality data, a large language model can learn human interaction behaviors. Therefore, in the data construction, great attention is paid to the source, quality and diversity of the data. For details of the dataset, please see the data introduction and the research team's paper.

Data Collection

  • The research team collected a lot of manually written text data from multiple sources on the Chinese Internet to ensure the diversity and richness of the data.
  • The sources of data include not only question-and-answer communities (such as Zhihu, Sifou, Douban, Xiaohongshu, and Chiba), but also wiki-like knowledge platforms (such as Baidu Encyclopedia), various types of examination materials (such as middle and high school entrance examination questions, professional qualification examination questions), and existing NLP datasets.
  • When collecting data, we focus on selecting relevant data that can reflect the real interaction patterns of Chinese users to enhance the model's understanding of real-world language usage.

COIG-CQIA.torrent
Seeding 2Downloading 0Completed 233Total Downloads 425
  • COIG-CQIA/
    • README.md
      1.4 KB
    • README.txt
      2.81 KB
      • data/
        • coig.zip
          88.8 MB