Westlake University Uses Transformer to Analyze the self-assembly Characteristics of Billions of Peptides and Crack the self-assembly Rules

2 years ago

Peptides are biologically active substances composed of two or more amino acids through peptide bonds, which can form higher-level protein structures through folding and helical formation. Peptides are not only associated with multiple physiological activities, but can also self-assemble into nanoparticles and participate in biological detection, drug delivery, and tissue engineering.
However, the sequence composition of peptides is too diverse, and only 10 amino acids can form more than 10 billion peptides. Therefore, it is difficult to conduct a comprehensive and systematic study of their self-assembly properties and optimize the design of self-assembling peptides.
To this end, Li Wenbin's research group at Westlake University used a Transformer-based regression network to predict the self-assembly properties of tens of billions of peptides, and analyzed the effects of amino acids at different positions on the self-assembly properties, providing a powerful new tool for the study of self-assembling peptides.

Author | Xuecai

Editor | Sanyang

Peptides are biologically active substances composed of two or more amino acids through peptide bonds.Peptides are easy to synthesize, biodegradable, biocompatible, and have rich chemical diversity, can form nanomaterials with fluorescence, semiconductor conductivity or magnetism. Because of this, peptides have received widespread attention in the scientific research community.

However, it is precisely because of the diversity of peptides thatThere is currently a lack of methods to predict its self-assembly tendency (AP, Aggregation Propensity), it is difficult to transform it into an ordered structure. Currently, only a few peptides can self-assemble to form supramolecular structures that meet the requirements and are put into industrial applications.

Figure 1: Specific fluorescence of different self-assembled probes to hCA, avidin, and trypsin

In the past few decades, self-assembling peptides have been mainly discovered through biological experiments.However, experiments often require a long period of time and have certain biases, which is not conducive to comprehensive and systematic research on a large number of peptides.

In recent years, computational screening has been widely used in the design of self-assembling peptides.In 2015, Frederix et al. used coarse-grained molecular dynamics (CGMD) to analyze the AP of tripeptides. However, as the number of amino acids increases, the number of peptide sequences will increase exponentially, which greatly increases the cost of CGMD.

Therefore, some researchers have combined AI and CGMD to reduce the analysis cost of traditional methods. However, AI-CGMD requires a large amount of training data.It is estimated that there are more than 10 billion decapeptide sequences, requiring 3.2 million peptide sequence data. Based on the above reasons, there is currently no AP prediction for peptides composed of more than 5 amino acids (pentapeptide).

To solve these problems,Li Wenbin's research group at Westlake University used a Transformer-based regression network (TRN) combined with CGMD to predict the self-assembly properties of tens of billions of peptides., obtained the AP of pentapeptides and decapeptides, and obtained the effect of amino acids at different positions on the AP of peptides. This result has been published in "Advanced Science".

Related results have been published in "Advanced Science"

Paper link:

https://onlinelibrary.wiley.com/doi/full/10.1002/advs.202301544

Experimental procedures

Training set: Latin hypercube sampling

First, 8,000 peptide sequences were screened using Latin hypercube sampling, and their APs were obtained by analyzing the screened peptide sequences using the CGMD model.

Model building: encoding and decoding

The researchers built an AP prediction model based on TRN.The model consists of a Transformer encoder and a multi-layer perceptron (MLP) decoder.The Transformer encoder consists of an input embedding layer (Input Embedding), a positional encoder (Positional Encoding) and an encoding block (Encording Block).

The input embedding layer is used to map the constituent units of the peptide (i.e., amino acids) into a 512-dimensional continuous space, and the position encoder outputs the position information of the amino acids. The encoding block includes a self-attention network and a feedforward neural network.

The Transformer encoder finally outputs a peptide sequence represented by a hidden layerAfter the sequence is reduced in dimension by MLP five times, it is compressed into a one-dimensional vector. The last layer of the MLP decoder outputs the AP of the peptide.

Figure 2: Workflow of the TRN model

a: Atomic models of α-helix and β-sheet and CG model of α-helix;

b: The process of outputting training data through CGMD;

c: Schematic diagram of the TRN model.

Experimental Results

Model prediction: Improved by 54.5%

The researchers compared the AP prediction performance of the TRN model with other non-deep learning models (support vector machine SVM, random forest RF, proximity algorithm NN, Bayesian regression BR and linear regression LR).

With only 8,000 training data, the model's coefficient of determination R2 exceeded 0.85, which was 11.8% higher than SVM and 54.5% higher than RF. .

Figure 3: Performance comparison of TRN model and other non-deep learning models

As the amount of training data increases, the performance of the TRN model increases. When the number of training data reaches 54,000, the mean absolute error (MAE) of the TRN model is 0.05 and the R2 is 0.92.

Figure 4: Effect of training data on TRN model performance

The above results show that compared with non-deep learning models,The TRN model can achieve higher prediction rates with less training data.At the same time, as the amount of training data increases, the performance of the TRN model improves.

Hydrophilicity: AP_HC Revision

It is reported that in addition to AP,The hydrophilicity (log P) of the peptide also affects the self-assembly of the peptide.

When AP increases from low to high, the median of log P decreases, indicating that peptides with strong hydrophilicity have poor aggregation ability. However, the AP of peptides with log P between 0.25-0.75 has a large span, distributed between 0-1, indicating that the relationship between the two is not close, and there are other factors that affect the AP of peptides.

Figure 5: Relationship between AP and log P

a: Correlation between AP and log P of 3.2 million pentapeptides;

b: Distribution of APs in different intervals;

c: Distribution of log P in different AP intervals.

To find out the effects of AP and log P on peptide self-assembly, the researchers used log P to correct AP and obtained AP_HC . Corrected AP_HC It is possible to distinguish peptide self-assembly and precipitation, and screen out peptides that can form hydrogels.

Figure 6: AP_HC Relationship with log P

a: AP of 3.2 million pentapeptides_HC Correlation with log P;

b: AP_HC Distribution in different intervals;

c: log P at different AP_HC The distribution of intervals.

Self-assembly rules: the influence of amino acids at different positions

The effects of 20 amino acids at different positions in the pentapeptide on AP_HC After studying the influence of different amino acids and their distribution on the self-assembly properties of polypeptides, the researchers summarized the effects of different amino acids and their distribution on the self-assembly properties of polypeptides and divided them into 5 groups.

The first group of amino acids includes phenylalanine (F), tyrosine (Y) and tryptophan (W). This group of amino acids has π-π stacking and strong hydrophobicity, and contributes most to the self-assembly of peptides.Among them, W has the strongest hydrophobicity and is most hydrophobic to AP._HC The impact of is the greatest, which is consistent with the observations of WWWWW.

Figure 7: Distribution ratio of 20 amino acids at different positions in different AP intervals

F, Y, and W contribute most to peptide self-assembly when they are at positions 3-5, especially at position 3. This may be because the amino acid at position 3 has a higher degree of freedom and is more likely to drive peptide self-assembly through π-π interaction.

Figure 8: π-π stacking diagram

However, these aromatic amino acids are strong proton acceptors at the 5th position and will interact with other polypeptides, increase the distance between the benzene rings, and weaken the π-π interaction within the molecule.

The second group of amino acids includes isoleucine (I), leucine (L), valine (V) and cysteine (C) .Since the side chains of these amino acids exclude water from each other, they are highly hydrophobic and contribute strongly to the self-assembly of peptides.This group of amino acids is often distributed at both ends of the polypeptide, especially the N-terminus of the self-assembling polypeptide.

Figure 9: Hydrophobic interactions of amino acids

The third group of amino acids includes histidine (H), serine (S) and threonine (T). This group of amino acids has polarized side chains that can enhance the self-assembly ability of polypeptides through hydrogen bonds.However, hydrogen bonding is weaker than π-π stacking, so at high AP_HC In the polypeptide, the content of the third group of amino acids is relatively small.

T and S tend to occupy the two ends of the peptide, especially the N-terminus, which is conducive to the formation of hydrogen bonds, while H stays away from the two ends of the peptide.

Figure 10: Effect of polar side chains on peptide structure

The fourth group of amino acids includes methionine (M) and proline (P) . M and P in different AP_HC The distribution of peptides is basically the same, and only has a slight impact on specific indicators of peptides.

The fifth group of amino acids is not conducive to the self-assembly of peptides, including negatively charged aspartic acid (D) and glutamic acid (E), positively charged lysine (K) and arginine (R), highly polar asparagine (N) and glutamine (Q), and side chain-free alanine (A) and glycine (G).

However, D and E at the C-terminus and R and K at the N-terminus can form a double-charged head group, which promotes the self-assembly of the peptide by attracting each other with opposite charges and forming a salt bridge. N and Q promote the dissolution of the peptide due to their strong polarity. A and G lack obvious interaction, which is not conducive to the self-assembly of the peptide.

Figure 11: Effect of Coulomb interaction on peptide structure

Experimental verification: basically consistent with CGMD and TEM results

To confirm the predictions of the TRN model, the researchers used CGMD to verify the self-assembly properties of five peptides. The calculated results of CGMD are basically consistent with the predicted results of TRN model.

At the same time, the self-assembly properties of NRMMR, DMGID, NRMMRDMGID and NRMMR + DMGID were also verified experimentally.The results of transmission electron microscopy (TEM) are basically consistent with those of CGMD.

Figure 12: Peptide self-assembly results observed by CGMD (a) and TEM (b)

The above results show thatThe TRN model can accurately predict the self-assembly properties of pentapeptides, decapeptides, and mixed pentapeptides, providing a powerful new tool for the study of self-assembling peptides.

Self-assembling peptides: a new direction in biomedicine

Although the research on the self-assembly characteristics of peptides is not in-depth enough,However, self-assembling peptides have been widely used in tissue engineering, drug delivery and biosensing.In addition, the contraction and relaxation of cells, the movement of endocytic vesicles, and the transmembrane transmission of bacteria and viruses are all inseparable from the self-assembly of polypeptides. Diseases such as Alzheimer's disease, Parkinson's disease, and type II diabetes are also related to protein misfolding.

Figure 13: Self-assembling peptides for anti-tumor drug delivery

With the development of AI, researchers are increasingly able to process large amounts of data. As biological research has evolved from traditional experimental research to computational research and then to AI research, the scale of research has also gradually increased from dozens or hundreds of possibilities to tens of billions.With the help of AI, humans are pushing the boundaries of biological research. I believe that in the future people will be able to conduct more detailed and comprehensive research on biology, allowing AI + biology to benefit the general public.

Reference Links:

https://pubs.rsc.org/en/content/articlelanding/2014/CS/C4CS00161C

Westlake University Uses Transformer to Analyze the self-assembly Characteristics of Billions of Peptides and Crack the self-assembly Rules

2 years ago

Information

Peptides are biologically active substances composed of two or more amino acids through peptide bonds, which can form higher-level protein structures through folding and helical formation. Peptides are not only associated with multiple physiological activities, but can also self-assemble into nanoparticles and participate in biological detection, drug delivery, and tissue engineering.
However, the sequence composition of peptides is too diverse, and only 10 amino acids can form more than 10 billion peptides. Therefore, it is difficult to conduct a comprehensive and systematic study of their self-assembly properties and optimize the design of self-assembling peptides.
To this end, Li Wenbin's research group at Westlake University used a Transformer-based regression network to predict the self-assembly properties of tens of billions of peptides, and analyzed the effects of amino acids at different positions on the self-assembly properties, providing a powerful new tool for the study of self-assembling peptides.

Author | Xuecai

Editor | Sanyang

Figure 1: Specific fluorescence of different self-assembled probes to hCA, avidin, and trypsin

Related results have been published in "Advanced Science"

Paper link:

https://onlinelibrary.wiley.com/doi/full/10.1002/advs.202301544

Experimental procedures

Training set: Latin hypercube sampling

First, 8,000 peptide sequences were screened using Latin hypercube sampling, and their APs were obtained by analyzing the screened peptide sequences using the CGMD model.

Model building: encoding and decoding

Figure 2: Workflow of the TRN model

a: Atomic models of α-helix and β-sheet and CG model of α-helix;

b: The process of outputting training data through CGMD;

c: Schematic diagram of the TRN model.

Experimental Results

Model prediction: Improved by 54.5%

With only 8,000 training data, the model's coefficient of determination R2 exceeded 0.85, which was 11.8% higher than SVM and 54.5% higher than RF. .

Figure 3: Performance comparison of TRN model and other non-deep learning models

Figure 4: Effect of training data on TRN model performance

Hydrophilicity: AP_HC Revision

It is reported that in addition to AP,The hydrophilicity (log P) of the peptide also affects the self-assembly of the peptide.

Figure 5: Relationship between AP and log P

a: Correlation between AP and log P of 3.2 million pentapeptides;

b: Distribution of APs in different intervals;

c: Distribution of log P in different AP intervals.

Figure 6: AP_HC Relationship with log P

a: AP of 3.2 million pentapeptides_HC Correlation with log P;

b: AP_HC Distribution in different intervals;

c: log P at different AP_HC The distribution of intervals.

Self-assembly rules: the influence of amino acids at different positions

Figure 7: Distribution ratio of 20 amino acids at different positions in different AP intervals

Figure 8: π-π stacking diagram

Figure 9: Hydrophobic interactions of amino acids

T and S tend to occupy the two ends of the peptide, especially the N-terminus, which is conducive to the formation of hydrogen bonds, while H stays away from the two ends of the peptide.

Figure 10: Effect of polar side chains on peptide structure

Figure 11: Effect of Coulomb interaction on peptide structure

Experimental verification: basically consistent with CGMD and TEM results

Figure 12: Peptide self-assembly results observed by CGMD (a) and TEM (b)

Self-assembling peptides: a new direction in biomedicine

Figure 13: Self-assembling peptides for anti-tumor drug delivery

Reference Links:

https://pubs.rsc.org/en/content/articlelanding/2014/CS/C4CS00161C

Command Palette

Westlake University Uses Transformer to Analyze the self-assembly Characteristics of Billions of Peptides and Crack the self-assembly Rules

Experimental procedures

Training set: Latin hypercube sampling

Model building: encoding and decoding

Experimental Results

Model prediction: Improved by 54.5%

Hydrophilicity: APHC Revision

Self-assembly rules: the influence of amino acids at different positions

Experimental verification: basically consistent with CGMD and TEM results

Self-assembling peptides: a new direction in biomedicine

Command Palette

Westlake University Uses Transformer to Analyze the self-assembly Characteristics of Billions of Peptides and Crack the self-assembly Rules

Experimental procedures

Training set: Latin hypercube sampling

Model building: encoding and decoding

Experimental Results

Model prediction: Improved by 54.5%

Hydrophilicity: APHC Revision

Self-assembly rules: the influence of amino acids at different positions

Experimental verification: basically consistent with CGMD and TEM results

Self-assembling peptides: a new direction in biomedicine

Related News

Up to 20 Times More Efficient! The University of California Develops OmniCast to Solve the Problem of Error Accumulation in Autoregressive Weather Forecasting models.

Columbia University and Stanford University Collaborate! Squidiff Enables multi-scenario Transcriptome Simulation, Contributing to the Development of Precision Medicine and Space medicine.

NeurIPS 2025 | MIT Proposes AutoSciDACT, an Automated Scientific Discovery Tool That Is Highly Sensitive to Anomalous Data in Astronomy, Physics, and biomedicine.

A New state-of-the-art Document Parsing Platform! MinerU's New Version Innovates a two-stage "coarse-to-fine" Parsing Strategy; S2S Domain Benchmark Debuts! Tencent's Latest Benchmark Dataset Evaluates Speech Model capabilities.

Breakthrough in 3D Vision: ByteSeed Launches DA3, Enabling Visual Space Reconstruction From Any Viewpoint; 70,000+ real-world Industrial Environment Data! CHIP Fills the Gap in Industrial Data for 6D Pose estimation.

AI Paper Weekly Report | De Novo Protein Design / First open-source Agent Solution / HunyuanOCR / Olmo 3 Language model... One-click Overview

Open Source, Best Value! Mistral AI Releases the Ministral 3 Series of Models, Integrating Multimodal Understanding and Intelligent Execution Capabilities; From high-dynamic Dance to Everyday Behavior, the X-Dance Dataset Unlocks multi-dimensional Testing for Human Animation generation.

Generating 18,000 Years of Climate Data, NVIDIA and Others Proposed long-distance Distillation, Enabling long-term Weather Forecasting With Only a single-step calculation.

The First Multimodal Astronomical Model, AION-1, Has Been Successfully Developed! Researchers From the University of California, Berkeley, and Others Have Successfully Constructed a Generalizable Multimodal Astronomical AI Framework Based on pre-training on 200 Million Astronomical targets.

Command Palette

Westlake University Uses Transformer to Analyze the self-assembly Characteristics of Billions of Peptides and Crack the self-assembly Rules

Experimental procedures

Training set: Latin hypercube sampling

Model building: encoding and decoding

Experimental Results

Model prediction: Improved by 54.5%

Hydrophilicity: APHC Revision

Self-assembly rules: the influence of amino acids at different positions

Experimental verification: basically consistent with CGMD and TEM results

Self-assembling peptides: a new direction in biomedicine

Related News

Up to 20 Times More Efficient! The University of California Develops OmniCast to Solve the Problem of Error Accumulation in Autoregressive Weather Forecasting models.

Columbia University and Stanford University Collaborate! Squidiff Enables multi-scenario Transcriptome Simulation, Contributing to the Development of Precision Medicine and Space medicine.

NeurIPS 2025 | MIT Proposes AutoSciDACT, an Automated Scientific Discovery Tool That Is Highly Sensitive to Anomalous Data in Astronomy, Physics, and biomedicine.

A New state-of-the-art Document Parsing Platform! MinerU's New Version Innovates a two-stage "coarse-to-fine" Parsing Strategy; S2S Domain Benchmark Debuts! Tencent's Latest Benchmark Dataset Evaluates Speech Model capabilities.

Breakthrough in 3D Vision: ByteSeed Launches DA3, Enabling Visual Space Reconstruction From Any Viewpoint; 70,000+ real-world Industrial Environment Data! CHIP Fills the Gap in Industrial Data for 6D Pose estimation.

AI Paper Weekly Report | De Novo Protein Design / First open-source Agent Solution / HunyuanOCR / Olmo 3 Language model... One-click Overview

Open Source, Best Value! Mistral AI Releases the Ministral 3 Series of Models, Integrating Multimodal Understanding and Intelligent Execution Capabilities; From high-dynamic Dance to Everyday Behavior, the X-Dance Dataset Unlocks multi-dimensional Testing for Human Animation generation.

Generating 18,000 Years of Climate Data, NVIDIA and Others Proposed long-distance Distillation, Enabling long-term Weather Forecasting With Only a single-step calculation.

The First Multimodal Astronomical Model, AION-1, Has Been Successfully Developed! Researchers From the University of California, Berkeley, and Others Have Successfully Constructed a Generalizable Multimodal Astronomical AI Framework Based on pre-training on 200 Million Astronomical targets.

Related News

Up to 20 Times More Efficient! The University of California Develops OmniCast to Solve the Problem of Error Accumulation in Autoregressive Weather Forecasting models.

Columbia University and Stanford University Collaborate! Squidiff Enables multi-scenario Transcriptome Simulation, Contributing to the Development of Precision Medicine and Space medicine.

NeurIPS 2025 | MIT Proposes AutoSciDACT, an Automated Scientific Discovery Tool That Is Highly Sensitive to Anomalous Data in Astronomy, Physics, and biomedicine.

A New state-of-the-art Document Parsing Platform! MinerU's New Version Innovates a two-stage "coarse-to-fine" Parsing Strategy; S2S Domain Benchmark Debuts! Tencent's Latest Benchmark Dataset Evaluates Speech Model capabilities.

Breakthrough in 3D Vision: ByteSeed Launches DA3, Enabling Visual Space Reconstruction From Any Viewpoint; 70,000+ real-world Industrial Environment Data! CHIP Fills the Gap in Industrial Data for 6D Pose estimation.

AI Paper Weekly Report | De Novo Protein Design / First open-source Agent Solution / HunyuanOCR / Olmo 3 Language model... One-click Overview

Open Source, Best Value! Mistral AI Releases the Ministral 3 Series of Models, Integrating Multimodal Understanding and Intelligent Execution Capabilities; From high-dynamic Dance to Everyday Behavior, the X-Dance Dataset Unlocks multi-dimensional Testing for Human Animation generation.

Generating 18,000 Years of Climate Data, NVIDIA and Others Proposed long-distance Distillation, Enabling long-term Weather Forecasting With Only a single-step calculation.

The First Multimodal Astronomical Model, AION-1, Has Been Successfully Developed! Researchers From the University of California, Berkeley, and Others Have Successfully Constructed a Generalizable Multimodal Astronomical AI Framework Based on pre-training on 200 Million Astronomical targets.

Related News

Up to 20 Times More Efficient! The University of California Develops OmniCast to Solve the Problem of Error Accumulation in Autoregressive Weather Forecasting models.

Columbia University and Stanford University Collaborate! Squidiff Enables multi-scenario Transcriptome Simulation, Contributing to the Development of Precision Medicine and Space medicine.

NeurIPS 2025 | MIT Proposes AutoSciDACT, an Automated Scientific Discovery Tool That Is Highly Sensitive to Anomalous Data in Astronomy, Physics, and biomedicine.

A New state-of-the-art Document Parsing Platform! MinerU's New Version Innovates a two-stage "coarse-to-fine" Parsing Strategy; S2S Domain Benchmark Debuts! Tencent's Latest Benchmark Dataset Evaluates Speech Model capabilities.

Breakthrough in 3D Vision: ByteSeed Launches DA3, Enabling Visual Space Reconstruction From Any Viewpoint; 70,000+ real-world Industrial Environment Data! CHIP Fills the Gap in Industrial Data for 6D Pose estimation.

AI Paper Weekly Report | De Novo Protein Design / First open-source Agent Solution / HunyuanOCR / Olmo 3 Language model... One-click Overview

Open Source, Best Value! Mistral AI Releases the Ministral 3 Series of Models, Integrating Multimodal Understanding and Intelligent Execution Capabilities; From high-dynamic Dance to Everyday Behavior, the X-Dance Dataset Unlocks multi-dimensional Testing for Human Animation generation.

Generating 18,000 Years of Climate Data, NVIDIA and Others Proposed long-distance Distillation, Enabling long-term Weather Forecasting With Only a single-step calculation.

The First Multimodal Astronomical Model, AION-1, Has Been Successfully Developed! Researchers From the University of California, Berkeley, and Others Have Successfully Constructed a Generalizable Multimodal Astronomical AI Framework Based on pre-training on 200 Million Astronomical targets.

Hydrophilicity: AP_HC Revision

Hydrophilicity: AP_HC Revision

Hydrophilicity: AP_HC Revision