@misc{bai2024coig, title={COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning}, author={Bai, Yuelin and Du, Xinrun and Liang, Yiming and Jin, Yonggang and Liu, Ziqiang and Zhou, Junting and Zheng, Tianyu and Zhang, Xincheng and Ma, Nuo and Wang, Zekun and others}, year={2024}, eprint={2403.18058}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Date

2 years ago

Size

88.8 MB

Organization

Tags

LLM

Natural Language Processing

Text Generation

Model Training

COIG-CQIA stands for Chinese Open Instruction Generalist – Quality is All You Need. It is an open source high-quality instruction fine-tuning dataset.Aims to provide the Chinese NLP community with high-quality instruction fine-tuning data that is consistent with human interaction behavior. COIG-CQIA uses questions and answers and articles obtained from the Chinese Internet as raw data, and is constructed after deep cleaning, reconstruction, and manual review. This project is inspired by studies such as LIMA: Less Is More for Alignment. Using a small amount of high-quality data, a large language model can learn human interaction behaviors. Therefore, in the data construction, great attention is paid to the source, quality and diversity of the data. For details of the dataset, please see the data introduction and the research team's paper. Data Collection

The research team collected a lot of manually written text data from multiple sources on the Chinese Internet to ensure the diversity and richness of the data.
The sources of data include not only question-and-answer communities (such as Zhihu, Sifou, Douban, Xiaohongshu, and Chiba), but also wiki-like knowledge platforms (such as Baidu Encyclopedia), various types of examination materials (such as middle and high school entrance examination questions, professional qualification examination questions), and existing NLP datasets.
When collecting data, we focus on selecting relevant data that can reflect the real interaction patterns of Chinese users to enhance the model's understanding of real-world language usage.

Citation

@misc{bai2024coig,
title={COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning},
author={Bai, Yuelin and Du, Xinrun and Liang, Yiming and Jin, Yonggang and Liu, Ziqiang and Zhou, Junting and Zheng, Tianyu and Zhang, Xincheng and Ma, Nuo and Wang, Zekun and others},
year={2024},
eprint={2403.18058},
archivePrefix={arXiv},
primaryClass={cs.CL}
}

COIG-CQIA.torrent

Seeding 1Downloading 0Completed 338Total Downloads 603

COIG-CQIA/
- README.md
  1.4 KB
- README.txt
  2.81 KB

This dataset is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at [email protected] for prompt review and removal.

Related Datasets

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

COIG-CQIA High-quality Chinese Instruction fine-tuning Dataset

Citation

Build AI with AI

HyperAI Newsletters

Command Palette

COIG-CQIA High-quality Chinese Instruction fine-tuning Dataset

Citation

Related Datasets

MAKIEVAL Multilingual Cultural Knowledge Assessment Dataset

Verbatim Spans Query Condition Evidence Extraction Dataset

Movie Feelings Dataset

ChartNet Chart Understanding Multimodal Dataset

World Air Pollution and AQI Dataset

SMOL Multilingual Translation Parallel Dataset

Build AI with AI

HyperAI Newsletters

Command Palette

COIG-CQIA High-quality Chinese Instruction fine-tuning Dataset

Citation

Related Datasets

MAKIEVAL Multilingual Cultural Knowledge Assessment Dataset

Verbatim Spans Query Condition Evidence Extraction Dataset

Movie Feelings Dataset

ChartNet Chart Understanding Multimodal Dataset

World Air Pollution and AQI Dataset

SMOL Multilingual Translation Parallel Dataset

Build AI with AI

HyperAI Newsletters

Related Datasets

MAKIEVAL Multilingual Cultural Knowledge Assessment Dataset

Verbatim Spans Query Condition Evidence Extraction Dataset

Movie Feelings Dataset

ChartNet Chart Understanding Multimodal Dataset

World Air Pollution and AQI Dataset

SMOL Multilingual Translation Parallel Dataset

Related Datasets

MAKIEVAL Multilingual Cultural Knowledge Assessment Dataset

Verbatim Spans Query Condition Evidence Extraction Dataset

Movie Feelings Dataset

ChartNet Chart Understanding Multimodal Dataset

World Air Pollution and AQI Dataset

SMOL Multilingual Translation Parallel Dataset