HyperAI

HH-RLHF is a human preference dataset released by Anthropic in 2022, which mainly consists of two parts.

Beneficial/harmless human preference data (PM Data):
- The relevant paper results areTraining a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback", which aims to use human preferences to adjust the dialogue model to be both "beneficial" and "harmless".
- This dataset consists of paired response comparison samples (each containing a chosen/rejected response), covering Helpfulness (from base, rejection-sampled, and online sources) and Harmlessness (base). The data format is simple and straightforward, so direct SFT is not recommended. It is suitable for scenarios such as RLHF/DPO training, reward model building, and response quality comparison and evaluation.
Red Team Conversation Data (Non-PM Data):
- The relevant paper results areRed Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned", which aims to study the types of attacks and manifestations of harm, and help reduce the harmfulness of models.
- This dataset consists of complete red team conversation transcripts and metadata, including transcript, min_harmlessness_score_transcript, model_type, rating, task_description, tags, and more. The data is close to real red team processes and richly annotated. It is not used for bias modeling or SFT, but is suitable for scenarios such as security alignment analysis, red team assessments, harm type induction, and policy improvement.

HH-RLHF Human Preference Dataset