HyperAI

Covering 7 Million Question-and-answer Data, Shanghai AI Lab Released ChemLLM, With Professional Capabilities Comparable to GPT-4

特色图像

With the rapid development of artificial intelligence technology, large language models (LLMs) have been widely used in scientific research such as life sciences, oceanography, and materials chemistry due to their powerful natural language processing capabilities. Although LLMs perform well in chemical-related tasks such as molecular property prediction, molecular generation, and experimental design, they perform poorly in processing various chemical downstream tasks.

The reason is that directly integrating chemical knowledge into language models faces three major challenges:First, most chemical information and knowledge are stored in structured databases. Using these data directly to train LLMs may impair the model's ability to process natural language, causing the model's dialogue and logical reasoning capabilities to degenerate. Second, in cheminformatics, molecules are represented by special symbols, such as SMILES. However, this type of data often does not conform to the norms of natural language, so conventional language models find it difficult to correctly understand and generate such symbols. Finally, there are many types of chemical data and tasks, and it is very difficult to design a flexible training process that can be generalized to a variety of chemical tasks.

In response to this, the Shanghai Artificial Intelligence Laboratory released the chemical large language model ChemLLM. ChemLLM excels at performing various tasks in the chemistry discipline through fluent conversational interactions, performing on par with GPT-4 on core tasks and demonstrating comparable performance to similarly sized LLMs in general scenarios. ChemLLM opens up new avenues for exploration in chemical research, and the research team's approach of integrating structured chemical knowledge into a conversational system sets a new standard for developing LLMs in various scientific fields.

The related research, titled "ChemLLM: A Chemical Large Language Model", has been published on arXiv. The results have been open sourced and provided free for commercial use.Currently, HyperAI Hyper.ai has launched "One-click deployment of chemical large model ChemLLM-7B-chat". The step-by-step tutorial is at the end of the article~

Research highlights:

* Created and open-sourced the large-scale chemical dataset ChemData, as well as the Chinese and English versions of the ChemPref-10K dataset, the C-MHChem dataset, and the ChemBench4K chemical capability evaluation benchmark dataset

* Created and open-sourced ChemBench, a large-scale chemistry benchmark test consisting of 4,100 multiple-choice questions and 9 specific tasks

* Through quantitative and qualitative evaluation tests, ChemLLM demonstrated good chemical expertise and versatility

Paper address:
https://arxiv.org/abs/2402.06852

The tutorial of the chemical large model ChemLLM-7B-chat is now online on hyper.ai. Click the link to deploy it with one click:
https://go.hyper.ai/r31KV

Download address of ChemData chemical task dataset:
https://go.hyper.ai/zMJEl

The open source project "awesome-ai4s" brings together more than 100 AI4S paper interpretations and provides massive data sets and tools:
https://github.com/hyperai/awesome-ai4s

ChemData dataset: A large-scale chemical dataset covering 7 million question-answer data

The researchers collected chemical data from numerous online resource repositories including PubChem, ChEMBL, ChEBI, ZINC, etc., and based on this, created a large-scale dataset ChemData for fine-tuning ChemLLM.

The ChemData dataset utilizes a template-based instruction construction approach to convert structured chemical data into a natural conversational form suitable for training LLMs.The dataset contains 7 million question-answering data for instruction fine-tuning, covering a wide range of chemical domain knowledge, and the question-answering data categories are consistent with molecules, reactions, and other chemistry-related task categories.

in,Molecular-related tasks include Name Conversion, Caption2Mol, Mol2Caption, and Molecular Property Prediction.The main purpose is to adjust the language model's perception of chemical molecules.

Reaction-related tasks involve all aspects of chemical reactions.Including Retrosynthesis, Product Prediction, Yield Prediction, Temperature Prediction and Solvent Prediction. Except for data that can be clearly classified, all other data are classified into specific types of tasks, thereby enhancing ChemLLM's understanding of the entire chemical space. The figure below shows the proportion of data contained in these three types of tasks.

Composition of ChemData datasets

ChemLLM model architecture: Based on InternLM2-Base-7B, two-stage instruction fine-tuning

The chemical large language model ChemLLM is trained based on the InternLM2-Base-7B model through a two-stage instruction fine-tuning method. It not only realizes multiple chemical capabilities, but also retains complete natural language capabilities.

As shown in the figure below, in the first phase, the research team used Multi-Corpus (a comprehensive corpus containing 1.7 million question-answer pairs collected from Hugging Face) to improve the general language ability of the model and named the model obtained in the first phase InternLM2-Chat-7B.

Schematic diagram of the two-stage instruction fine-tuning process of ChemLLM

In the second phase, the research team fine-tuned the model using a mixed dataset of ChemData and Multi-Corpus, where ChemData was used to enhance the model's chemical knowledge and Multi-Corpus was used to retain the model's general capabilities. After two phases of instruction fine-tuning training, the versatility of ChemLLM in the field of chemistry was improved.

ChemBench Benchmark: Reducing the Impact of Language Model Output Style on Evaluation Results

Existing large-scale chemical model benchmarks are mostly presented in the form of question answering, and use BLEU and ROUGE as evaluation criteria. However, this type of evaluation is easily affected by the output style of the language model and is not suitable for scenarios that emphasize the correctness of scientific facts.

Based on this, the research team built a chemical benchmark test ChemBench, which is similar to the current mainstream evaluation sets MMLU and C-Eval. ChemBench includes 9 tasks on chemical molecules and reactions, and is the same as the tasks in the ChemData dataset.In addition, ChemBench contains 4,100 multiple-choice questions, each with a correct answer, which aims to minimize the impact of the language model output style on the evaluation results.

It is worth mentioning that the benchmark has been launched on the OpenCompass open source project. The figure below shows the distribution of the 9 tasks of the ChemBench benchmark.

Distribution of 9 tasks in ChemBench benchmark test

Research results: ChemLLM model chemistry expertise is comparable to GPT-4 and significantly better than general LLMs of similar size

The research team evaluated the performance of the chemical large language model ChemLLM from both quantitative and qualitative dimensions.The quantitative assessment includes chemical ability and general ability assessment, while the qualitative assessment is mainly evaluated through performance in chemistry-related NLP (natural language processing) tasks.

In the chemical ability assessment,ChemBench, as a benchmark for evaluating core chemistry capabilities, tests the professionalism of the model through 9 different tasks. As shown in the figure below, ChemLLM is significantly better than general large language models (LLMs) of similar size, and surpasses GPT-3.5 in all aspects. Compared with InternLM2-Chat-7B, ChemLLM's ability in chemistry has been significantly improved, indicating that the second phase of chemistry ability training has a significant effect. Compared with GPT-4, ChemLLM scored higher than GPT-4 in 6 of the 9 tasks.

ChemLLM Chemical Performance Evaluation Score

In the general competency assessment,The research team used four datasets, MMLU, C-Eval, GSM8K, and C-MHChem, to evaluate ChemLLM. Among them, MMLU is a benchmark test covering interdisciplinary subjects such as STEM (science, technology, engineering and mathematics), humanities and social sciences, and conducts a broad assessment of interdisciplinary knowledge; C-Eval is a comprehensive Chinese benchmark test that covers multiple subjects and is divided into four difficulty levels; GSM8K is a benchmark test for testing the mathematical ability of language models, requiring 2-8 steps of basic mathematical operations to solve problems; C-MHChem is a dataset for evaluating the basic chemical concepts of the model, mainly involving junior and senior high school chemistry tests.

As shown in the figure below, ChemLLM achieves accuracies of 65.6 and 64.1 on the English MMLU and Chinese C-Eval benchmarks, respectively, demonstrating its excellent performance in a wider range of disciplines and multilingual scenarios.

In the GSM8K dataset test, the accuracy of ChemLLM reached 67.2. The results showed that fine-tuning on chemical data enhanced the model's reasoning ability to a certain extent.

In the C-MHChem dataset test, ChemLLM achieved an accuracy of 76.4, surpassing GPT-4, demonstrating ChemLLM's capabilities in Chinese junior and senior high school entrance exams.

ChemLLM General Performance Assessment Score

In qualitative assessment,The research team evaluated ChemLLM through chemistry-related NLP (natural language processing) tasks such as chemical poetry creation, text extraction, chemical literature translation, and ethical answers. The results show that ChemLLM can provide a deeper understanding and creative application of chemical knowledge in various NLP tasks. The following figure lists the performance of ChemLLM on some NLP tasks:

ChemLLM Chemical Poetry Writing
ChemLLM Chemical Information Extraction

The above research results show that ChemLLM is able to handle various chemical tasks through real-time conversations. Its chemical capabilities are comparable to those of GPT-4 and it performs well in other fields.


Currently, ChemLLM has completed a new round of upgrades. ChemLLM-1.5 is connected to the RAG function, which not only supports in-depth mining and understanding of chemical literature and online search, but also supports direct dialogue with ChemLLM to discuss article content. The development of ChemLLM sets a precedent for LLMs in scientific fields, further accelerating the progress of chemical research in the AI era.

HyperAI Hyper.ai has launched the "One-click deployment of chemical large model ChemLLM-7B-chat".The following is a step-by-step tutorial and effect display. Let’s explore it with the editor~

One-click deployment of the chemical large model ChemLLM-7B-chat

Demo Run

  1. Log in to hyper.ai, on the "Tutorial" page, select "One-click deployment of the Pu Ke chemical large model ChemLLM-7B-chat Demo", and click "Run this tutorial online".

2. After the page jumps, click "Clone" in the upper right corner to clone the tutorial into your own container.

3. Click "Next: Select Hashrate" in the lower right corner.

4. After the page jumps, select "NVIDIA GeForce RTX 4090" and click "Next: Review". New users can register using the invitation link below to get 4 hours of RTX 4090 + 5 hours of CPU free time!

HyperAI exclusive invitation link (copy and open in browser):
https://openbayes.com/console/signup?r=6bJ0ljLFsFh_Vvej

5. Click "Continue" and wait for resources to be allocated. The first clone will take about 2 minutes. When the status changes to "Running", click the jump arrow next to "API Address" to jump to the page of "One-click deployment of Puke Chemical Large Model ChemLLM-7B-chat Demo". Please note that users must be authenticated by real name before using the API address access function.

If the issue persists for more than 10 minutes and remains in the "Allocating resources" state, try stopping and restarting the container. If restarting still does not resolve the issue, please contact the platform customer service on the official website.

Effect Preview

Testing ethical dilemmas in drug development

References:
1. https://mp.weixin.qq.com/s/C_aFYbzLlQySmTDarWWRkA
2. https://mp.weixin.qq.com/s/b9T9LxAkv4gnJMfBs2AW5Q