HyperAI

ChemData Chemical Task Dataset

Date

a year ago

Size

242.89 MB

Organization

Shanghai Artificial Intelligence Laboratory

Publish URL

huggingface.co

* This dataset supports online use.Click here to jump.

Dataset Introduction

This dataset was open-sourced by the Shanghai Artificial Intelligence Laboratory in 2024 together with its first scientific big model, the Pu Ke Chemical Big Model (ChemLLM). The related paper results are "ChemLLM: A Chemical Large Language Model".

The data set mainly includes ChemData700K. The research team also open-sourced the Chinese and English versions of ChemBench-4K, ChemPref-10K and the C-MHChem data set.

ChemData700K dataset

ChemData700K is a large language model chemistry capability instruction fine-tuning dataset that includes 9 core chemistry tasks and 730K high-quality questions and answers, sampled from 1/10 of 7 million data. The dataset covers a wide range of chemical domain knowledge and is divided into 3 main task categories (molecules, reactions, and domains).

ChemBench4K benchmark dataset

ChemBench is an innovative benchmark consisting of 9 tasks on chemical molecules and reactions. These 9 tasks are the same as those in ChemData. The benchmark provides a basis for objectively measuring the chemistry proficiency of LLM students. ChemBench contains 4,100 multiple-choice questions with one correct answer.

ChemPref-10K dataset

This dataset can be used to optimize language models to match human preferences and contains both English and Chinese versions.

C-MHChem dataset

C-MHChem is a high-quality, fully manually written, multiple-choice test benchmark consisting of 600 questions collected from junior high school, high school, and college entrance examinations in various parts of China over the past 25 years.

ChemLLM-Dataset.torrent
Seeding 1Downloading 0Completed 166Total Downloads 601
  • ChemLLM-Dataset/
    • README.md
      2.09 KB
    • README.txt
      4.18 KB
      • data/
        • chem.zip
          242.89 MB