HyperAI

RedGPT (Reference-Enlightened-Dialogue GPT) It is a dialogue generation model enhanced by reference information.

As we all know, factual correctness is a major weakness of ChatGPT and a major challenge faced by all peers who try to reproduce ChatGPT. To improve factual correctness, a large amount of factual dialogue data (such as people, technology, medical, legal, and art) can be annotated for fine-tuning the GPT model. In order to avoid the expensive cost of manual annotation, we propose a method to automatically generate factual dialogues and make some of our data public. The first batch of data we made public (RedGPT-Dataset-V1-CN) contains a total of 50,000 multi-round dialogues in Chinese.

Method Introduction

The goal of this dataset is to automatically generate massive, high-quality, factual multi-round dialogues for training GPT and improving its factual correctness.

We automatically generate data using the following method:

Collect high-quality factual documents, which we call Reference, and their sources can be e-books, Wikipedia, and high-quality vertical websites. The documents need to cover as many topics as possible, including but not limited to people, institutions, technology, medical care, law, humanities, economy, home, automobiles, travel, food, fashion, sports, education, and pets.
Use existing LLM (such as paid API) to generate multi-round dialogues. The input is a reference, and the prompt is similar to "Please generate multi-round questions and answers based on this article." The API will output a multi-round dialogue. This method converts documents that were originally only suitable for pre-training into multi-round dialogues that can be fine-tuned.
In step 2, a large number of Reference-Dialogue pairs are collected. Using Reference and Prompt as input and Dialogue as target, fine-tune a GPT model (which can be based on the pre-trained base of LLaMA or BLOOM). We call the fine-tuned model Reference-Enlightened-Dialogue GPT,abbreviation RedGPT With RedGPT, you can generate multiple rounds of dialogue based on Reference and obtain massive amounts of data.

To reproduce this method, please note 2 key points:

The quality and breadth of references. The quality of reference content must be high, such as pages from high-quality vertical websites such as medical, and non-obscure entries on Wikipedia, and the web pages need to be cleaned. The breadth of references should be large and cannot be limited to a single vertical category or a single website.
When calling an existing LLM, you need to write a prompt and try various prompts carefully to make the multi-round dialogue generated by the LLM meet your expectations.

RedGPT: A Dialogue Generation Model Enhanced by Reference Information

Method Introduction