HyperAI

The Magpie-Pro-300K-Filtered dataset is a high-quality instruction dataset synthesized using the Magpie method, which is extracted from Llama-3 70B. This dataset contains about 300k high-quality dialogues and is generated through an automated self-synthesis process that exploits the autoregressive properties of aligned LLMs to generate user queries and corresponding replies.

This dataset is provided by Llama 3 70B Instruct use Magpie Generate. SeepaperandCodebasefor details.

This is the filtered data. Please do not use both Magpie-Pro-300K-Filtered and Magpie-Pro-MT-300K to fine-tune the model, as they are roughly the same in the first round.

Dataset background

The Magpie-align project is a self-synthesis method for synthesizing high-quality instruction data directly from large language models (LLMs) themselves, named Magpie. The key idea of the project is to leverage the autoregressive properties of aligned LLMs (such as Llama-3-Instruct) to generate user queries by simply inputting a pre-query template. With this approach, Magpie is able to generate millions of instructions and their corresponding responses, and select high-quality instances from them to form a dataset.

Magpie-Pro-300K-Filtered high-quality Alignment Dataset

Dataset background

Build AI with AI

Hyper Newsletters

Command Palette

Magpie-Pro-300K-Filtered high-quality Alignment Dataset

Dataset background

Build AI with AI

Hyper Newsletters