HyperAIHyperAI

HH-RLHF Human Preference Dataset

Date

a month ago

Size

90.35 MB

Organization

Anthropic

Publish URL

huggingface.co

Paper URL

2209.07858

License

MIT

*This dataset supports online use.Click here to jump.

HH-RLHF is a human preference dataset released by Anthropic in 2022, which mainly consists of two parts.

Dataset composition:

  • Beneficial/harmless human preference data (PM Data):
    • The relevant paper results areTraining a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback", which aims to use human preferences to adjust the dialogue model to be both "beneficial" and "harmless".
    • This dataset consists of paired response comparison samples (each containing a chosen/rejected response), covering Helpfulness (from base, rejection-sampled, and online sources) and Harmlessness (base). The data format is simple and straightforward, so direct SFT is not recommended. It is suitable for scenarios such as RLHF/DPO training, reward model building, and response quality comparison and evaluation.
  • Red Team Conversation Data (Non-PM Data):
    • The relevant paper results areRed Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned", which aims to study the types of attacks and manifestations of harm, and help reduce the harmfulness of models.
    • This dataset consists of complete red team conversation transcripts and metadata, including transcript, min_harmlessness_score_transcript, model_type, rating, task_description, tags, and more. The data is close to real red team processes and richly annotated. It is not used for bias modeling or SFT, but is suitable for scenarios such as security alignment analysis, red team assessments, harm type induction, and policy improvement.

HH-RLHF.torrent
Seeding 1Downloading 0Completed 8Total Downloads 32
  • HH-RLHF/
    • README.md
      1.98 KB
    • README.txt
      3.96 KB
      • data/
        • HH-RLHF.zip
          90.35 MB