@misc{zhang2024chemllm, title={ChemLLM: A Chemical Large Language Model}, author={Di Zhang and Wei Liu and Qian Tan and Jingdan Chen and Hang Yan and Yuliang Yan and Jiatong Li and Weiran Huang and Xiangyu Yue and Dongzhan Zhou and Shufei Zhang and Mao Su and Han-Sen Zhong and Yuqiang Li and Wanli Ouyang}, year={2024}, eprint={2402.06852}, archivePrefix={arXiv}, primaryClass={cs.AI} }

Date

2 years ago

Size

242.89 MB

Organization

Paper URL

arxiv.org

Dataset Introduction

This dataset was open-sourced by the Shanghai Artificial Intelligence Laboratory in 2024 together with its first scientific big model, the Pu Ke Chemical Big Model (ChemLLM). The related paper results are "ChemLLM: A Chemical Large Language Model". The data set mainly includes ChemData700K. The research team also open-sourced the Chinese and English versions of ChemBench-4K, ChemPref-10K and the C-MHChem data set.

ChemData700K dataset

ChemData700K is a large language model chemistry capability instruction fine-tuning dataset that includes 9 core chemistry tasks and 730K high-quality questions and answers, sampled from 1/10 of 7 million data. The dataset covers a wide range of chemical domain knowledge and is divided into 3 main task categories (molecules, reactions, and domains).

ChemBench4K benchmark dataset

ChemBench is an innovative benchmark consisting of 9 tasks on chemical molecules and reactions. These 9 tasks are the same as those in ChemData. The benchmark provides a basis for objectively measuring the chemistry proficiency of LLM students. ChemBench contains 4,100 multiple-choice questions with one correct answer.

ChemPref-10K dataset

This dataset can be used to optimize language models to match human preferences and contains both English and Chinese versions.

C-MHChem dataset

C-MHChem is a high-quality, fully manually written, multiple-choice test benchmark consisting of 600 questions collected from junior high school, high school, and college entrance examinations in various parts of China over the past 25 years.

Citation

@misc{zhang2024chemllm,
title={ChemLLM: A Chemical Large Language Model},
author={Di Zhang and Wei Liu and Qian Tan and Jingdan Chen and Hang Yan and Yuliang Yan and Jiatong Li and Weiran Huang and Xiangyu Yue and Dongzhan Zhou and Shufei Zhang and Mao Su and Han-Sen Zhong and Yuqiang Li and Wanli Ouyang},
year={2024},
eprint={2402.06852},
archivePrefix={arXiv},
primaryClass={cs.AI}
}

ChemLLM-Dataset.torrent

Seeding 1Downloading 0Completed 272Total Downloads 924

ChemLLM-Dataset/
- README.md
  2.09 KB
- README.txt
  4.18 KB

This dataset is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at [email protected] for prompt review and removal.

Related Datasets

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

ChemData Chemical Task Dataset

Dataset Introduction

ChemData700K dataset

ChemBench4K benchmark dataset

ChemPref-10K dataset

C-MHChem dataset

Citation

Build AI with AI

HyperAI Newsletters

Command Palette

ChemData Chemical Task Dataset

Dataset Introduction

ChemData700K dataset

ChemBench4K benchmark dataset

ChemPref-10K dataset

C-MHChem dataset

Citation

Related Datasets

MAKIEVAL Multilingual Cultural Knowledge Assessment Dataset

SAM 3D Artist Objects 3D Object Reconstruction Dataset

FigureBench Scientific Illustration Generation Benchmark Dataset

SMOL Multilingual Translation Parallel Dataset

chi-bench Medical Intelligent Agent Benchmark Evaluation Dataset

VisCoR-55K Visual Inference Dataset

QCalEval Quantum Calibration Graph Understanding Dataset

MDPBench Multilingual Document Parsing Benchmark Dataset

Build AI with AI

HyperAI Newsletters

Command Palette

ChemData Chemical Task Dataset

Dataset Introduction

ChemData700K dataset

ChemBench4K benchmark dataset

ChemPref-10K dataset

C-MHChem dataset

Citation

Related Datasets

MAKIEVAL Multilingual Cultural Knowledge Assessment Dataset

SAM 3D Artist Objects 3D Object Reconstruction Dataset

FigureBench Scientific Illustration Generation Benchmark Dataset

SMOL Multilingual Translation Parallel Dataset

chi-bench Medical Intelligent Agent Benchmark Evaluation Dataset

VisCoR-55K Visual Inference Dataset

QCalEval Quantum Calibration Graph Understanding Dataset

MDPBench Multilingual Document Parsing Benchmark Dataset

Build AI with AI

HyperAI Newsletters

Related Datasets

MAKIEVAL Multilingual Cultural Knowledge Assessment Dataset

SAM 3D Artist Objects 3D Object Reconstruction Dataset

FigureBench Scientific Illustration Generation Benchmark Dataset

SMOL Multilingual Translation Parallel Dataset

chi-bench Medical Intelligent Agent Benchmark Evaluation Dataset

VisCoR-55K Visual Inference Dataset

QCalEval Quantum Calibration Graph Understanding Dataset

MDPBench Multilingual Document Parsing Benchmark Dataset

Related Datasets

MAKIEVAL Multilingual Cultural Knowledge Assessment Dataset

SAM 3D Artist Objects 3D Object Reconstruction Dataset

FigureBench Scientific Illustration Generation Benchmark Dataset

SMOL Multilingual Translation Parallel Dataset

chi-bench Medical Intelligent Agent Benchmark Evaluation Dataset

VisCoR-55K Visual Inference Dataset

QCalEval Quantum Calibration Graph Understanding Dataset

MDPBench Multilingual Document Parsing Benchmark Dataset