InfinityInstruct-3M Launches Ten Million Instruction Fine-tuning Dataset
Date
Size
Publish URL
Categories
InfinityInstruct is a large-scale, high-quality, open-source instruction fine-tuning dataset project launched by Beijing Academy of Artificial Intelligence (BAAI). The goal of the project is to develop a dataset containing millions of instructions to support instruction tracing capabilities of large language models, thereby improving model performance.
This version is the InfinityInstruct-3M instruction dataset, and the final version is expected to be released at the end of June.
Features of InfinityInstruct include:
- Large-scale datasets:The project plans to release tens of millions of command data, and 3 million Chinese and English command data have been released in the first phase.
- High quality screening:The Zhiyuan Research Institute conducts field analysis and quality screening on existing open source data to ensure the high value of the data, and augments the data in areas where it is lacking.
- Open Source Community Contributions: During the dataset construction process, the open source community provided a large amount of instruction data, including datasets from multiple sources, such as OpenHermes-2.5, UltraInteract_sft, CodeBagel, etc.
- Risk Assessment and Data Generation: The project team is currently conducting risk assessment and data generation and expects to release the final version containing 10 million instructions by the end of June.
- Performance Improvements: The current open source 3 million instruction data set has demonstrated SFT (Supervised Fine-Tuning) data capabilities that surpass existing data sets such as Mistral and Openhermes.
- Future Outlook: It is expected that after the data volume increases to tens of millions, the dialogue model trained based on the instruction fine-tuning dataset will be able to reach the level of GPT-4.
The development and release of the InfinityInstruct dataset is of great significance for promoting the research and application of large language models. It provides rich instruction data for large models, which helps improve the model's ability to understand and execute instructions. At the same time, its open source nature also promotes collaboration and knowledge sharing in the AI community.