HyperAI

Selected for NeurIPS 24! Zhejiang University Team Proposed a New Denoising Protein Language Model DePLM, Which Predicts Mutation Effects Better Than SOTA Models

特色图像

As the main carrier of biological functions, the diversity of protein structure and function displayed in billions of years of evolution has provided important opportunities for progress in fields such as drug discovery and materials science. However, the inherent properties of existing proteins (such as thermal stability) often cannot meet practical needs in many cases. Therefore, researchers are committed to enhancing their properties by optimizing proteins.

Traditional deep mutation scanning (DMS) and directed evolution (DE) rely on expensive wet experimental techniques. In contrast, machine learning-based methods can quickly evaluate mutation effects, which is crucial for efficient protein optimization.Among them, a widely used research approach is to use evolutionary information to test the effects of mutations.Evolutionary information can be used to infer the effect of a mutation by the probability of an amino acid appearing at a certain position in a protein sequence. To calculate the relative probability of mutating one amino acid to another, mainstream methods use protein language models (PLMs) trained on millions of protein sequences to capture evolutionary information in a self-supervised manner.

However, existing approaches often overlook two key aspects:- First, existing methods fail to remove irrelevant evolutionary information. Evolution optimizes multiple characteristics simultaneously to meet survival needs, which often obscures the optimization of target characteristics. Second, the current mainstream learning objectives contain dataset-specific information, which often overfits to the current training data, limiting the model's ability to generalize to new proteins.

To address these challenges, Professor Chen Huajun, Dr. Zhang Qiang and others from the School of Computer Science and Technology, Zhejiang University, Zhejiang University International College, and Zhejiang University Hangzhou International Science and Technology Innovation Center jointly proposed a new denoising protein language model (DePLM) optimized for proteins.The key is to regard the evolutionary information EI captured by the protein language model as a mixture of feature-related and irrelevant information, where irrelevant information is similar to the "noise" of the target feature, so this "noise" needs to be eliminated. A large number of experiments have shown that the sorting-based denoising process proposed in this study significantly improves protein optimization performance while maintaining strong generalization capabilities.

The related results were selected for the top conference NeurIPS 24 under the title "DePLM: Denoising Protein Language Models for Property Optimization".

Research highlights:

* DePLM can effectively filter out irrelevant information and improve protein optimization by optimizing the evolutionary information contained in PLM

* This study designs a ranking-based forward process in a denoising diffusion framework, extending the diffusion process to the ranking space of mutation possibilities, while transforming the learning objective from minimizing numerical error to maximizing ranking relevance, promoting dataset-independent learning and ensuring strong generalization capabilities.

* Extensive experimental results show that DePLM not only outperforms the current state-of-the-art models in predicting mutation effects, but also exhibits strong generalization capabilities for new proteins


Paper address:
https://neurips.cc/virtual/2024/poster/95517 

Follow the official account and reply "Denoised Protein Language Model" to get the complete PDF

ProteinGym protein mutation dataset download:
https://hyper.ai/datasets/32818

The open source project "awesome-ai4s" brings together more than 100 AI4S paper interpretations and provides massive data sets and tools:

https://github.com/hyperai/awesome-ai4s

Dataset: An extensive collection of deep mutation screening experiments

ProteinGym is an extensive collection of deep mutational screening (DMS) experiments containing 217 datasets.Due to the length limit of PLM, the researchers excluded datasets containing wild-type proteins with a length of more than 1,024, and finally retained 201 DMS datasets. ProteinGym classifies DMS into five rough categories: 66 for stability, 69 for fitness, 16 for expression, 12 for binding, and 38 for activity.

* Performance comparison experiment:The researchers used a randomized cross-validation approach, in which each mutation in the dataset was randomly assigned to one of five folds, and the performance of the model was then evaluated by averaging the results of these five folds.

* Generalization ability experiment:Given a test dataset, researchers randomly select up to 40 datasets that are consistent with their optimization goal (such as thermal stability) as training data, ensuring that the sequence similarity between the training protein and the test protein is less than 50% to avoid data leakage.

Model Architecture: Extending Diffusion Models via Forward Processing in Ordering Space

As mentioned above, the core of DePLM is to regard the evolutionary information EI captured by the protein language model PLM as a mixture of feature-related and irrelevant information, where irrelevant information is similar to the "noise" of the target feature, and eliminate this "noise". To achieve this goal, the researchers drew inspiration from the denoising diffusion model, which generates the desired output by refining the noisy input.

Specifically, the researchers designed a forward process based on sorting information to extend the diffusion model to denoise evolutionary information, as shown in the figure below.On the left side of the figure below, DePLM uses the evolution likelihood derived from PLM as input and generates a denoised likelihood for a specific attribute to predict the impact of mutations; in the middle and right side of the figure below, the denoising module uses the feature encoder to generate representations of the protein, taking into account primary and tertiary structures, which are then used to filter the noise in the likelihood through the denoising module.


DePLM Architecture Overview

Denoising diffusion models consist of two main processes:There is a forward diffusion process and a reverse denoising process that needs to be learned. In the forward diffusion process, a small amount of noise is gradually added to the true value; then, the reverse denoising process learns to restore the true value by gradually eliminating the accumulated noise.

However, there are two major challenges when applying these models to denoising mutation probabilities in protein optimization. First, the relationship between actual feature values and experimental measurements often exhibits nonlinearity, which stems from the diversity of experimental methods. Therefore, relying solely on minimizing the difference between predicted and observed values for denoising may cause the model to overfit to a specific dataset, thereby reducing the model's generalization ability. Second, unlike traditional denoising diffusion models, researchers require the accumulated noise to converge.

To address these challenges, the researchers proposed a rank-based denoising diffusion process.The focus is on maximizing the ranking relevance, as shown in the figure below. On the left side of the figure below, the training of DePLM involves 2 main steps: a forward corruption process and a learned reverse denoising process.

In the noise addition step, the researchers used a sorting algorithm to generate trajectories, transforming from the sorting of property-specific likelihood to the sorting of evolution likelihood, and DePLM was trained to simulate this reverse process. On the right side of the figure below, the researchers show the change in the Spearman coefficient during the transition from evolution likelihood to property-specific likelihood.


DePLM training process

Finally, in order to achieve dataset-independent learning and strong model generalization capabilities,The researchers conduct a diffusion process in the ordination space of feature values and replace the traditional objective of minimizing numerical error with maximizing ordination relevance.

Research results: DePLM has superior performance and strong generalization ability

Performance evaluation: Verifying the advantages of combining evolutionary information with experimental data

First, to evaluate the performance of DePLM in protein engineering tasks, the researchers compared it with nine baselines, including four protein sequence encoders trained from scratch (CNN, ResNet, LSTM, and Transformer), and five self-supervised models (OHE, a fine-tuned version of ESM-1v, ESM-MSA, Tranception, and ProteinNPT).

The results are shown in the following table, where the best result and the second best result are marked in bold and underlined respectively.DePLM outperforms the baseline models, confirming the advantage of combining evolutionary information with experimental data in protein engineering tasks.


Performance of DePLM and baseline models in protein engineering tasks


It is worth noting that ESM-MSA and Tranception show stronger evolutionary information than ESM-1v due to the introduction of multiple sequence alignment (MSA). By comparing their results, the researchers demonstrated that high-quality evolutionary information significantly improved the results after fine-tuning. However, even with these improvements, their performance still failed to reach the level of DePLM. The researchers also noted thatDePLM outperforms ProteinNPT, emphasizing the effectiveness of the proposed denoising training procedure.

Generalization ability evaluation: Eliminate the influence of irrelevant factors and improve performance

Next, to further evaluate the generalization ability of DePLM, the researchers compared it with four self-supervised baselines (ESM-1v, ESM-2, and TranceptEVE), two structure-based baselines (ESM-IF and ProteinMPNN), and three supervised baselines (CNN, ESM-1v, and fine-tuned versions of ESM-2).

The results are shown in the following table. The best result and the second best result are marked in bold and underlined respectively. It can be observed thatDePLM consistently outperforms all baseline models - further demonstrating the inadequacy of models that rely solely on unfiltered evolutionary information, which often dilutes the target property by optimizing multiple objectives simultaneously. By eliminating the influence of irrelevant factors, DePLM significantly improves performance.


Generalization ability assessment

In addition, the baseline models ESM1v (FT) and ESM2 (FT), which are trained to minimize the difference between the predicted and experimental scores, perform much worse than DePLM. This result indicates thatOptimizing the model in the ordination space reduces the bias from a specific dataset, leading to better generalization.In addition, the researchers observed that protein structural information contributes to stability and binding, while evolutionary information enhances adaptability and activity properties.

In summary, a large number of experimental results show thatDePLM not only outperforms the current state-of-the-art models in predicting mutation effects, but also exhibits strong generalization capabilities to novel proteins.

Zhejiang University team continues to deepen PLMs and promote the development of the bio-industry

The protein big language model has the ability to accurately predict protein structure, function and interaction, and represents the cutting-edge application of AI technology in biology. By learning the patterns and structures of protein sequences, it can predict the function and morphology of proteins, which is of great significance for new drug development, disease treatment and basic biological research.

Faced with this promising emerging field, the Zhejiang University team has continued to delve into it in recent years and has achieved a number of innovative scientific research results.

In March 2023, Professor Huajun Chen, Dr. Qiang Zhang and their AI Interdisciplinary Center research team developed a pre-training model for protein language. The relevant research of this model was published at the 2023 ICLR International Conference on Machine Learning Representation under the title "Multi-level Protein Structure Pre-training with Prompt Learning". It is worth mentioning that the ICLR conference is one of the top conferences in the field of deep learning, led by two Turing Award winners Yoshua Bengio and Yann LeCun.

In this work, the research team was the first in the world to propose a protein-oriented prompt learning mechanism and construct the PromptProtein model.Three pre-training tasks were designed to inject the first, third, and fourth level structural information of proteins into the model. In order to flexibly use structural information, inspired by the prompt technology in natural language processing, the researchers proposed a prompt-guided pre-training and fine-tuning framework. Experimental results on protein function prediction tasks and protein engineering tasks show that the proposed method has better performance than traditional models.

By 2024, the team has made further progress in their research. To address the challenge that PLMs are good at understanding amino acid sequences but cannot understand human language,The team of Chen Huajun and Zhang Qiang from Zhejiang University proposed the InstructProtein model, which uses knowledge instructions to align protein language and human language, explores the bidirectional generation capabilities between protein language and human language, effectively bridges the gap between the two languages, and demonstrates the ability to integrate biological sequences into large language models.

The research, titled "InstructProtein: Aligning Human and Protein Language via Knowledge Instruction", was accepted by the main conference of ACL 2024. Experiments on a large number of bidirectional protein-text generation tasks show that InstructProtein outperforms existing state-of-the-art LLMs.

Click to view detailed report: Selected for ACL2024 Main Conference | InstructProtein: Aligning protein language with human language using knowledge instructions

Paper address: 

https://arxiv.org/abs/2310.03269

In fact, these articles are only one aspect of the work being done by the team. According to reports, researchers at the Zhejiang University AI Interdisciplinary Center hope to achieve how to use protein or molecular language models to drive scientific experimental robots such as iBioFoundry and iChemFoundry, combining real-world sensor signals, proteins, and human language to establish a link between language and perception.

In the future, the team looks forward to further industrializing their research results and making more valuable explorations and support for new drug development and the life and health fields.

References:

1.https://neurips.cc/virtual/2024/poster/95517

2.https://hic.zju.edu.cn/2023/0328/c56130a2733579/page.htm