Magpie-Pro-300K-Filtered High-quality Alignment Dataset
Date
Size
Publish URL
Categories

The Magpie-Pro-300K-Filtered dataset is a high-quality instruction dataset synthesized using the Magpie method, which is extracted from Llama-3 70B. This dataset contains about 300k high-quality dialogues and is generated through an automated self-synthesis process that exploits the autoregressive properties of aligned LLMs to generate user queries and corresponding replies.
This dataset is provided by Llama 3 70B Instruct use Magpie Generate. SeepaperandCodebasefor details.
This is the filtered data. Please do not use both Magpie-Pro-300K-Filtered and Magpie-Pro-MT-300K to fine-tune the model, as they are roughly the same in the first round.
Dataset background
The Magpie-align project is a self-synthesis method for synthesizing high-quality instruction data directly from large language models (LLMs) themselves, named Magpie. The key idea of the project is to leverage the autoregressive properties of aligned LLMs (such as Llama-3-Instruct) to generate user queries by simply inputting a pre-query template. With this approach, Magpie is able to generate millions of instructions and their corresponding responses, and select high-quality instances from them to form a dataset.