HyperAIHyperAI

Command Palette

Search for a command to run...

Magpie-Pro-300K-Filtered high-quality Alignment Dataset

Date

a year ago

Size

469.91 MB

Organization

Allen Institute for Artificial Intelligence
University of Washington

Paper URL

arxiv.org

Featured Image

The Magpie-Pro-300K-Filtered dataset is a high-quality instruction dataset synthesized using the Magpie method, which is extracted from Llama-3 70B. This dataset contains about 300k high-quality dialogues and is generated through an automated self-synthesis process that exploits the autoregressive properties of aligned LLMs to generate user queries and corresponding replies.

This dataset is provided by Llama 3 70B Instruct use Magpie Generate. SeepaperandCodebasefor details.

This is the filtered data. Please do not use both Magpie-Pro-300K-Filtered and Magpie-Pro-MT-300K to fine-tune the model, as they are roughly the same in the first round.

Dataset background

The Magpie-align project is a self-synthesis method for synthesizing high-quality instruction data directly from large language models (LLMs) themselves, named Magpie. The key idea of the project is to leverage the autoregressive properties of aligned LLMs (such as Llama-3-Instruct) to generate user queries by simply inputting a pre-query template. With this approach, Magpie is able to generate millions of instructions and their corresponding responses, and select high-quality instances from them to form a dataset.

Magpie-Pro-300K-Filtered.torrent
Seeding 1Downloading 0Completed 152Total Downloads 213
  • Magpie-Pro-300K-Filtered/
    • README.md
      1.91 KB
    • README.txt
      3.83 KB
      • data/
        • Magpie-Pro-300K-Filtered.zip
          469.91 MB

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp