New Breakthrough in Vaccine Research and Development: Beihang Team Proposes a New Method for Predicting Viral Antigen Immunogenicity, VirusImmu

Infectious diseases are a serious type of disease that seriously endangers human health and life. Among the more than 4,000 viruses discovered so far, more than 100 can directly threaten human health and life. What is even more frightening is that new pathogens are constantly being discovered. According to media reports, of the 32 new infectious diseases discovered in the world in the past 20 years, about half have appeared in my country.
Therefore, vaccine development is particularly important. In the long process of vaccine development, the first task is to identify protective immunogens. Machine learning (ML) methods are very efficient in analyzing big data such as microbial proteomes and can significantly reduce the cost of experimental work for developing new vaccine candidates.
Li Jing and others from the Beijing University of Aeronautics and Astronautics developed a machine learning ensemble method (Viruslmmu) for predicting the immunogenicity of viral antigens, which showed great potential in predicting the immunogenicity of viral protein fragments and provided more comprehensive tools for vaccine developers. Related content was published on bioRxiv.

Paper address:
https://www.biorxiv.org/content/10.1101/2023.11.23.568426v1
Follow the official account and reply "immunity" to download the paper
Dataset: Hundreds of antigens involved in training and testing
The training and testing datasets consisted of 100 antigens (positive set) and 100 non-antigens (negative set).
Dataset download address:
https://github.com/zhangjbig/VirusImmu/tree/main/data

The protective antigens are verified protein antigens screened from the literature. The corresponding protein sequences are from UniProt (Universal Protein) and NCBI (National Center for Biotechnology Information). Proteins with complete fragments are preferred.
Note: UniProt is the most information-rich and resource-rich protein database.
Unprotected protein sequences (non-antigenic) were randomly selected from the Virus Bioinformatics Resource Center.
The researchers used BLAST (Basic Local Alignment Search Tool) to confirm that the non-antigen had no sequence identity with the antigen, and used a random sampling cross-validation strategy to obtain a test set from the positive and negative data sets of 20%. Fifty random groupings were performed.
Note: BLAST is a biological macromolecule sequence comparison search tool.
The external dataset was independently constructed by researchers and consisted of 59 antigens and 54 non-antigens, where the antigen sequences were manually collated from the UniProt and Protegen databases, and the non-antigen sequences were randomly selected from UniProt in the same training method.
Building the best ensemble model VirusImmu
In the past decade, methods for predicting the immunogenicity of protein antigens have been divided into two main categories: filtering and classification. The most representative method of classification prediction is VaxiJen, which proposes a method for predicting protective bacterial antigens.
However, VaxiJen focuses on the prediction of bacterial immunogenicity. In order to overcome the limitations of VaxiJen, researchers from Beihang University proposed an integrated machine learning method VirusImmu for the prediction of viral immunogenicity.
Different from VaxiJen which only uses a single traditional regression algorithm or is simply based on majority voting, VirusImmu adopts a soft voting method to evaluate the performance of eight machine learning models in predicting antigen immunogenicity through a random sampling cross-validation strategy.
The researchers conducted a total of 50 rounds of randomized experiments, each of which divided the dataset into a training set and a test set at a ratio of 8:2. The training set was used to train each model, and then the trained model was evaluated for immunogenicity prediction on the test set.

The average ROC statistics of 50 rounds of randomized experiments showed that RF had the strongest predictive ability.
In order to improve the predictive ability of the model for immunogenicity,Researchers built a soft voting ensemble classifier (VirusImmu) based on the first three models (RF, XGBoost and kNN).The predictions of RF, XGBoost, and kNN are weighted and combined to obtain the sum of weighted probabilities.
To determine the weights for RF, XGBoost, and kNN, the researchers enumerated all possible weights for each (232 in total), increased the weights from 0 to 1 in increments of 0.05, and used ROC analysis to evaluate the performance of the models at different weights.
The results show that VirusImmu outperforms each individual test set model.
VirusImmu has superb performance regardless of protein sequence length
* Comparative experiment 1: Performance comparison between VirusImmu and VaxiJen
VaxiJen is one of the few methods that uses the physicochemical properties of protein sequences to predict immunogenicity. Unlike VirusImmu, Vaxijen uses a single traditional regression algorithm or majority voting. Therefore, the researchers compared the performance of VirusImmu with VaxiJen.
In the test set, the AUC (Area Under the Curve) of VirusImmu is 0.782, and the AUC of VaxiJen is 0.75. The average ROC curve shows that VirusImmu is better than VaxiJen (confidence interval is 95%).
* Comparative experiment 2: Performance comparison of VirusImmu with RF, kNN and XGBoost
To further validate the performance of VirusImmu, researchers independently collected an external test set containing 59 antigens and 54 non-antigens.
The ROC curve shows that VirusImmu (AUC=0.712) outperforms RF (AUC=0.676) and kNN (AUC=0.699), and its performance is similar to XGBoost (AUC=0.717). VaxiJen performs the worst on the external test set (AUC=0.609).
in short,VirusImmu produced more stable protein immunogenicity predictions compared to eight commonly used ML prediction methods and VaxiJen on both the test set and the external test set.
* Comparative experiment 3: Performance comparison of VirusImmu, NetBCE and EpiDope
The researchers also compared the performance of VirusImmu with that of two recently published prediction methods, NetBCE and EpiDope. NetBCE can only predict the immunogenicity of protein sequences with less than 24 amino acids.VirusImmu can take into account both long and short protein sequence fragments. Although EpiDope combines the Embedding Language Model (ELMo) deep neural network (DNN) and the Long Short-Term Memory (LSTM) DNN, achieving an AUC of 0.667, it also performs worse than VirusImmu (AUC=0.712).

* Comparative experiment 4: Robustness comparison between Virusimmu and other models
To test the robustness of all models, researchers conducted 50 rounds of random sampling, each using about 30% of antigen and non-antigen samples in the external test set. VirusImmu achieved better performance than VaxiJen in terms of AUC and F1 Score.
Note: F1 Score is the harmonic mean of the model's precision and recall.
Since the predictive ability of the model may be affected by the length of the protein sequence, the researchers grouped the external test set into five groups with an incremental step of 200 bp in protein sequence length, and then performed 50 rounds of random sampling.
XGBoost and Virusimmu both achieved good performance (the top two) in the external validation data. XGBoost’s AUC was slightly better than Virusimmu’s, but its F1 Score was worse. XGBoost also performed worse than Virusimmu for proteins smaller than 200 bp and 600-800 bp.
Since most epitopes are protein fragments with a length of less than 200, Virusimmu has better application scenarios than XGBoost.
Overall,Viruslmmu is not based on sequence comparison and eliminates the influence of protein sequence length. Compared with similar prediction tools, it is suitable for the prediction of proteins and peptides with higher accuracy and greater versatility.

To further demonstrate the reliability of VirusImmu, researchers selected SARS-CoV-2 epitopes from published literature to verify the immunogenicity prediction ability of VirusImmu.
The results show thatAmong the 15 epitopes involved in the four papers, 14 were predicted as antigens by VirusImmu, which verified the good performance of VirusImmu in predicting the immunogenicity of viral proteins.
VirusImmu helps identify peptide vaccine candidates for African swine fever virus (ASFV)
Since there is no effective vaccine or treatment for African swine fever virus, it is necessary to identify protective antigens. The study found that the ASFV pp220 polyprotein, which is critical for the structural integrity of the virus, contains epitopes that can induce a strong immune response in pigs, indicating that it has the potential to be used in vaccine development.
To identify antigenic epitopes, the researchers used 17 of the most popular methods, including BCPred, the Immune Epitope Database (IEDB) server, and predicted 1,376 B-cell linear epitope candidates from the pp220 protein.
The researchers used strict criteria to filter out antigenic epitopes, and according to the prediction results of VaxiJen≤1.3, 29 epitopes remained, of which 12 epitopes were classified as non-allergens and non-toxins. VirusImmu predicted that 8 of the 12 epitopes were antigenic.

In order to confirm the binding of the 8 epitopes to ASFV serum IgG antibodies, the researchers collected mixed sera from 5 ASFV-infected pigs and 5 healthy pigs.
Seven antigenic linear B cell epitopes were confirmed by indirect ELISA assay, but one of them reacted specifically and dose-dependently with serum antibodies from ASFV-infected pigs but not from healthy pigs, while an arbitrary control peptide ('RRRRRRRRRRRRRR') had no effect. An epitope predicted as non-antigenic by VirusImmu ('VLEEQSKIDPNF') also showed no specific binding with serum antibodies.
These results provide a strong example for the application of VirusImmu in real-world scenarios.
AI technology accelerates vaccine development
With the rapid development of science and technology, AI has made new breakthroughs in the field of biomedicine, including Alphaford 2 developed by Deepmind, which successfully predicted protein structure, and later new technologies such as generative protein. In the process of drug development, AI technology plays more of a tool.

First, AI can be used for analysis and prediction of viral genomes.Through deep learning and pattern recognition of large amounts of viral genome data, AI can accurately predict the mutation and evolution trends of the virus, helping scientists to quickly identify the key protein targets of the virus and quickly develop related vaccines.
Secondly, AI plays an important role in the drug screening stage of vaccine development.The traditional drug screening process is usually time-consuming, labor-intensive and uncertain. However, through large-scale simulation experiments and data mining, AI can quickly evaluate the interaction between drugs and viruses, screen out candidate drugs with potential activity, and improve the efficiency of vaccine development.
Additionally, AI can be used to optimize the design of vaccine clinical trials.By simulating large-scale experimental data, AI can help scientists predict and evaluate the response and effect of vaccines in the human body, discover possible safety issues and side effects in advance, and optimize the design of experiments.
In terms of the market, multinational pharmaceutical companies tend to pay more attention to AI technology. According to statistics from AI consulting agency Deep Pharma Intelligence, as of December 2022, the total investment of 800 AI pharmaceutical companies worldwide reached US$5.93 billion, a 27-fold increase in 9 years.
So, what challenges does AI technology face in the development of vaccines and other drugs? According to Li Wenwen, assistant professor of the Department of Information Management and Business Intelligence at the School of Management of Fudan University, the formation of AI algorithms requires a huge amount of data to learn, and in the field of drug development, this data includes the relevant structure of proteins, different strings of amino acid sequences, etc.
At present, the difficulty of applying AI technology in drug research and development lies in data acquisition and accumulation. Laboratory data is expensive, while pharmaceutical companies do not share enough data, and basic, labeled data is scarce. These are all limitations.