HyperAI

Luo Xiaozhou's Team From the Chinese Academy of Sciences Proposed the UniKP Framework, a Large Model + Machine Learning to Predict Enzyme Kinetic Parameters With High Precision

特色图像

Author: Li Baozhu

Editor: Sanyang

Luo Xiaozhou's team from the Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, proposed a framework for predicting enzyme kinetic parameters (UniKP) to achieve the prediction of a variety of different enzyme kinetic parameters.

As we all know, metabolism in organisms is achieved through various chemical reactions. If these reactions are carried out in vitro, they usually need to occur under severe conditions such as high temperature, high pressure, strong acid, and strong alkali.

However, in living organisms, metabolic reactions can proceed efficiently under extremely mild conditions, mainly due to important organic catalysts - enzymes.

As a high-scoring knowledge point throughout high school biology, the characteristics of enzymes may have been deeply imprinted in everyone's memory - high catalytic efficiency, strong specificity, mild action conditions, etc. More importantly, enzymes are closely related to many human diseases and can also be used for diagnosis and treatment. People have been studying the molecular structure and function of enzymes while continuing to explore the factors affecting enzymatic reactions.

The science that studies the rate of enzyme reactions and the mechanism by which various factors affect the rate of enzyme reactions is called "enzyme reaction kinetics".In research, the catalytic efficiency of an enzyme in a specific reaction is usually measured by enzyme kinetic parameters.

The kinetic parameters of enzyme-catalyzed reactions include the enzyme turnover number kcat , Michaelis constant Km  and catalytic efficiency kcat / Km  Currently, wet experiments are mainly used to measure parameters, but this process is time-consuming and costly, which makes the database of experimentally measured enzyme kinetic parameters relatively small. The scarcity of data will limit the development of downstream systems biology and metabolic engineering fields.

In view of this,Luo Xiaozhou's team from the Institute of Synthesis, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences proposed an enzyme kinetic parameters prediction framework (UniKP) based on a pre-trained large language model and a machine learning model.

This framework can predict a variety of enzyme kinetic parameters only by giving the amino acid sequence of the enzyme and the structural information of the substrate. In addition, the research team further took environmental factors into consideration and proposed a double-layer framework EF-UniKP based on UniKP, which achieved more accurate prediction of enzyme kinetic parameters.

The research results have been published in Nature Communications

Paper link:
https://www.nature.com/articles/s41467-023-44113-1
GitHub link:
https://github.com/Luo-SynBioLab/UniKP

Follow the official account and reply "UniKP" to download the full paper

Representative data sets validate model value

The research team selected four representative datasets to verify the performance and value of UniKP.

First is the DLKcat dataset,The researchers screened 16,838 samples, including 7,822 unique protein sequences and 2,672 unique substrates from 851 organisms. The dataset was divided into training and test sets in a 9:1 ratio.

Next are the pH and temperature datasets,The pH dataset contains 636 samples, consisting of 261 unique enzyme sequences and 331 unique substrates; the temperature dataset contains 572 samples, consisting of 243 unique enzyme sequences and 302 unique substrates. The dataset is divided into training set and test set in a ratio of 8:2.

The third is the Michaelis constant (Km) dataset,It consists of 11,722 samples, including enzyme sequences, substrate molecular fingerprints and corresponding Km  The data set is divided into training set and test set in the ratio of 8:2.

The fourth is kcat/Km  Dataset,Contains 910 enzyme sequences, substrate structures and their corresponding kcat/Km  A sample of values.

Two key components: representation module + machine learning module

UniKP proposed by the research team can improve the prediction of k based on given enzyme sequences and substrate structures.cat , Km  and kcat / Km  accuracy. The UniKP framework consists of two key components - a representation module and a machine learning module.

The role of the representation module is to convert complex enzyme and substrate information into vector representations that can be understood and processed by the machine learning model.This allows subsequent machine learning modules to perform predictions and analysis.


Among them, the enzyme sequence representation module uses the pre-trained language model ProtT5-XL-UniRef50 to encode the enzyme information. Each amino acid is converted into a 1,024-dimensional vector through the model and processed by averaged by mean pooling, and finally a 1,024-dimensional vector is generated to represent the sequence information of the entire enzyme (as shown in the figure above).

The substrate structure representation module uses the pre-trained language model SMILES Transformer model to encode the information of the substrate. The substrate structure is converted into SMILES format, and then a 1,024-dimensional vector is generated through the pre-trained SMILES transformer. The first output of the last layer and the second-to-last layer is averaged and max-pooled, and finally a 1,024-dimensional vector is generated to represent the structural information of the substrate (as shown in the figure above).

For the machine learning module,The research team compared 16 different machine learning models and two representative deep learning models - convolutional neural networks and recurrent neural networks.

The results show that the integrated model shows better performance, especially random forests and extra trees, which are significantly better than other models, among which extra trees perform best (R²=0.65). As shown in the figure above, the machine learning model takes the connection representation vector as input and generates the predicted kcat , Km  or kcat / Km  value.

In addition, the researchers took environmental factors into account, generated an optimized prediction framework, and validated it on two datasets covering pH and temperature information (as shown in the figure above).

Finally, UniKP adjusts the sample weight distribution through different reweighting methods to produce optimized prediction results for high-value prediction tasks (as shown in the figure above).

Double-layer frame——EF-UniKP

As a two-layer framework, EF-UniKP consists of a base layer and a meta layer, as shown in the following figure:

EF-UniKP Architecture

The base layer contains two independent models - UniKP and Revised UniKP. UniKP takes the connection representation vector of protein and substrate as input, while Revised UniKP uses the connection representation vector of protein and substrate, combined with pH or temperature value as input.

The meta-layer consists of a linear regression model using the predicted k from UniKP and Revised UniKPcat  value to predict the final kcat  value.

R² value is higher than 20%, EF-UniKP wins

The research team at kcat  The UniKP framework was validated on the prediction task using the DLKcat dataset, which contains 16,838 samples. In the 5-round randomly divided test set validation, the R² value of UniKP was 0.68, which was 20% higher than DLKcat. In addition, in the test, the highest value of DLKcat was 16% lower than the lowest value of UniKP, further proving the robustness of UniKP.

UniKP in kcat Predicted performance


The research team then created two datasets covering pH and temperature information to evaluate EF-UniKP, and divided them into training and test sets in a ratio of 8:2, respectively.

On the test set,EF-UniKP performs better than UniKP and Revised UniKP.In the pH dataset test, the R² of EF-UniKP is 20% and 8% higher, and in the temperature dataset test, the R² of EF-UniKP is 26% and 2% higher. In the test where at least one of the enzyme and substrate is not in the training set, the R² value of EF-UniKP is 13% and 10% higher than UniKP and Revised UniKP on the pH dataset, and 16% and 4% higher on the temperature dataset.

EF-UniKP performs better than UniKP and Revised UniKP

Butterfly model: integrating scientific research and industry

The Shenzhen Institutes of Advanced Technology of the Chinese Academy of Sciences (hereinafter referred to as "Shenzhen Advanced Institute") behind Luo Xiaozhou's research group was jointly established by the Chinese Academy of Sciences, Shenzhen Municipal People's Government and the Chinese University of Hong Kong in February 2006. It consists of 8 research institutes:

* Shenzhen Institute of Advanced Integrated Technology, Chinese Academy of Sciences, The Chinese University of Hong Kong

* Institute of Biomedical and Health Engineering

* Institute of Advanced Computing and Digital Engineering

* Institute of Biomedicine and Technology

* Institute of Brain Cognition and Brain Diseases

* Institute of Synthetic Biology

* Institute of Advanced Materials Science and Engineering

* Carbon Neutrality Technology Research Institute (Preparatory)

Dr. Luo Xiaozhou completed his postdoctoral research at the University of California, Berkeley in 2019, returned to China and officially joined the Institute of Synthetic Biology of the Shenzhen Institute of Advanced Technology as a researcher. In the same year, "Senruis Bio" prepared by him as one of the partners was also officially established in Shenzhen, focusing on the research and development of synthetic biology technology and its innovative applications in various fields. In March 2022, the company completed a round A financing of nearly 100 million yuan.

Dr. Luo Xiaozhou's development path of balancing "research" and "industry" is perfectly in line with the purpose of Shenzhen Advanced Institutes.Shenzhen Institute of Advanced Technology explored the "0-1-10-∞ butterfly model".This has also been well practiced at Senruis Biotech.

After discovering that liquid rubber HVR and cannabinoid CBD can share the same independent intellectual property chassis cells, Senruis used several process methods developed in the early stage for the transformation of brewer's yeast, combined with its internal synthetic biological component library,The production of liquid rubber HVR was increased to commercially viable levels within 6 months.

Among them, Dr. Luo Xiaozhou collaborated with his mentor, Academician Jay D. Keasling, who is also one of the founders of Senruis, and successfully opened up the biological synthesis pathway of cannabinoids in 2019, which became the basis for its commercialization.

Luo Xiaozhou said that there are two key factors to achieve the rapid industrialization of pipelines:First, the deep integration of academia and industry.The academic community effectively builds 0-1 synthetic pathways for compounds needed by the industry;The second is standardized production processes and tools.Covering three stages from 0-1 academic research, 1-10 engineering research and development, to 10 – unlimited industrial scale-up, we will build a synthetic biology production line and improve the research and development efficiency from 1-10.

References:
https://www.siat.ac.cn/cyjl2016/202203/t20220330_6416153.html
https://mp.weixin.qq.com/s/QsAqhqIBwYhDfdtY1zJACw