HyperAI

Magpie-Pro-300K-Filtered High-quality Alignment Dataset

Date

10 months ago

Size

469.91 MB

Organization

Allen Institute for Artificial Intelligence
University of Washington

Publish URL

huggingface.co

特色图像

The Magpie-Pro-300K-Filtered dataset is a high-quality instruction dataset synthesized using the Magpie method, which is extracted from Llama-3 70B. This dataset contains about 300k high-quality dialogues and is generated through an automated self-synthesis process that exploits the autoregressive properties of aligned LLMs to generate user queries and corresponding replies.

This dataset is provided by Llama 3 70B Instruct use Magpie Generate. SeepaperandCodebasefor details.

This is the filtered data. Please do not use both Magpie-Pro-300K-Filtered and Magpie-Pro-MT-300K to fine-tune the model, as they are roughly the same in the first round.

Dataset background

The Magpie-align project is a self-synthesis method for synthesizing high-quality instruction data directly from large language models (LLMs) themselves, named Magpie. The key idea of the project is to leverage the autoregressive properties of aligned LLMs (such as Llama-3-Instruct) to generate user queries by simply inputting a pre-query template. With this approach, Magpie is able to generate millions of instructions and their corresponding responses, and select high-quality instances from them to form a dataset.

Magpie-Pro-300K-Filtered.torrent
Seeding 2Downloading 1Completed 58Total Downloads 70
  • Magpie-Pro-300K-Filtered/
    • README.md
      1.91 KB
    • README.txt
      3.83 KB
      • data/
        • Magpie-Pro-300K-Filtered.zip
          469.91 MB