MiniMind Large Model Training Fine-tuning Dataset
Date
2 months ago
Size
8.08 GB
Publish URL
Categories
MiniMind is an open source lightweight large language model project that aims to lower the threshold for using large language models (LLM) and enable individual users to quickly train and infer on ordinary devices.
MiniMind includes multiple datasets, such as the tokenizer training set for training word segmenters, the Pretrain data for pre-training models, the SFT data for supervised fine-tuning, and the DPO data 1 and DPO data 2 for training reward models. These datasets are integrated from different sources, such as SFT data from Jiangshu Technology, Qwen2.5 distillation data, etc., with a total of about 3B tokens, which are suitable for pre-training of large Chinese language models.
minimind_dataset.torrent
Seeding 1Downloading 1Completed 33Total Downloads 49