Yu Xiang's Research Group at Shanghai Jiao Tong University Released a Transferable Deep Learning Model to Identify Multiple Types of RNA Modifications and Significantly Reduce Computational Costs

2 years ago

In 2021, due to the loud call of Gao Fu, an academician of the Chinese Academy of Sciences, mRNA vaccines became famous overnight and became people's hope during the outbreak of the new coronavirus. Today, that special past has become history, but the RNA modification behind mRNA vaccines is still moving forward at a rapid pace.

The so-called RNA modification is an important type of post-transcriptional regulation that can be widely involved in various RNA post-transcriptional processing and metabolic pathways.

RNA modification deserves attention because it plays a vital biological function in the growth and development of eukaryotic organisms.For example, recent studies have found that the destabilization effect of N⁶-methyladenosine (m⁶A) in mammalian embryonic stem cells is related to a variety of diseases, and 5-methylcytosine (m⁵C) is related to the tolerance of rice to high temperature.

However, RNA has many types of modifications. So far, more than 160 types of modifications have been found in natural RNA. Previously, the nanopore direct RNA sequencing (DRS) technology developed by Oxford Nanopore Technologies (ONT) combined with deep learning methods can realize the modification identification of single bases.However, this method has difficulty in detecting multiple modification types simultaneously in a single sample.

In response to the above questions, the research group of Yu Xiang, a tenured associate professor at the School of Life Sciences and Technology of Shanghai Jiao Tong University, and the team of Yang Jun/Wang Hongxia from Shanghai Chenshan Botanical Garden published a research paper titled "Transfer learning enables identification of multiple types of RNA modifications using nanopore direct RNA sequencing" in Nature Communications.A transferable deep learning model TandemMod was developed to enable identification of multiple types of RNA modifications in DRS.

Research highlights:

* Under the condition of ensuring the same performance, significantly reduce the computing costs such as the amount of training set data and model training time

* TandemMod provides important technical support for the identification of various types of RNA modification sites and epitranscriptome studies in animals, plants and microorganisms

* TandemMod can also be used to detect artificially modified RNA such as RNA vaccines

Paper address:

https://www.nature.com/articles/s41467-024-48437-4
The open source project "awesome-ai4s" brings together more than 100 AI4S paper interpretations and provides massive data sets and tools:

https://github.com/hyperai/awesome-ai4s

Dataset: Targeted training with multiple datasets

In order to train and evaluate the performance of the TandemMod model, the research team used multiple data sets for experiments.

first,The research team used the ELIGOS in vitro transcription dataset generated by the Nookaew laboratory.Five base-level features (mean, median, standard deviation, signal length, and base quality) were calculated for six modified bases (m¹A, m⁶A, m⁵C, hm⁵C, m⁷G, and Ψ) and compared with the unmodified bases.

Secondly, the research team chose to study the performance of TandemMod based on the two most common modifications in eukaryotic mRNA, m⁵C and m⁶A.The researchers trained the TandemMod m⁵C model on the Curlcake dataset.The dataset is derived from in vitro transcribed sequences containing all possible 5-mers and is divided into training and testing sets in a 4:1 ratio.

Schematic diagram of using rice cDNA library to transcribe transcripts with m¹A, m⁶A, and m⁵C tags in vitro and create in vitro transcriptome through DRS

At the same time, in order to solve the problem that RNA transcribed from in vitro synthetic sequences cannot cover the entire range of natural sequences, the research team performed in vitro transcription on a rice cDNA library containing a T7 promoter, and obtained thousands of transcripts with different modification tags. After adding polyA tails, four training sets (m¹A, m⁶A, m⁵C, and unmodified bases) were constructed through DRS.It is called the In Vitro Apparent Transcriptome Dataset (IVET).

Model Architecture: A Deep Learning Framework

Based on this, the research team used the electrical signals assigned to every 5 bases and their statistical characteristics as input to train the transfer learning model TandemMod, which can simultaneously detect multiple types of RNA modifications.

Schematic diagram of building the TandemMod deep learning framework based on DRS data

As shown in the figure above,TandemMod consists of data preprocessing, model pretraining, and transfer learning.

Among them, model pre-training consists of 4 main components:

* One-dimensional convolutional neural network (1D-CNN) to extract local features of the original current intensity signal;

* Bi-LSTM (Bi-LSTM), which is used to capture long-term correlations between adjacent signals and improve the ability to understand context in a longer process;

* Attention mechanism, which is used to weight the importance of each feature at different times and improve the model's ability to capture important signals;

* The classifier of the fully-connected layers is responsible for making predictions based on the combined information of all features.

In addition, to verify whether transfer learning can be applied to DRS data to detect multiple types of RNA modifications,The researchers trained TandemMod on the IVET m5C dataset and obtained a pretrained model.In the TandemMod model, the top layer acts as a feature extractor and the bottom layer acts as a classifier. The researchers froze the top layer of the pre-trained model and retrained the bottom layer on the ELIGOS training set (hm5C, m7G, Ψ, and I) to minimize the classification error.

Performance evaluation curve of transfer learning for detecting m⁷G, hm⁵C, Ψ and I types of RNA modification sites

After 2 epochs, all models achieved high accuracy.The ROC-AUCs of hm⁵C, m⁷G, Ψ, and I reached 0.98, 0.95, 0.96, and 0.97, respectively. As shown in Figures a, b, c, and d above.

Experimental results: TandemMod significantly reduces the amount of training set data and model training time

In the experimental stage, the research team compared the TandemMod model with classic machine learning algorithms to evaluate its performance, including XGBoost, support vector machine (SVM) and k-nearest neighbor (KNN). In the case of Curlcake test dataset m⁶A recognition,TandemMod outperforms other algorithms with an accuracy of 0.90.Similarly, for the identification of m⁵C, TandemMod achieved an accuracy of 0.95, and this comparison highlights the effectiveness of TandemMod in identifying modifications using DRS data.

TandemMod also showed better superiority than tombo and xPore in identifying samples with different modification rate levels in vivo.This indicates that TandemMod can accurately predict samples with different modification rates without the need for negative control samples.

Comparison of TandemMod with other mainstream DRS-based algorithms for predicting m⁶A modification sites

In addition, the research team also compared the TandemMod m⁶A model with tombo, nanom6A and m6Anet, as shown in the figure above.

On the ELIGOS rash (RA or G, HA, or C or U) motif, the ROC-AUCs of TandemMod, nanom6A, and tombo were 0.96, 0.88, and 0.52, respectively. On the ELIGOS DRACH (DA, G, or U) motif, the ROC-AUCs of TandemMod, m6Anet, and tombo were 0.95, 0.71, and 0.64, respectively.

These results indicate thatTrained using the in vitro DRS dataset, TandemMod provides the most accurate read-level predictions among existing tools.

The research team verified the classification performance, required training data, and computing resource utilization of the TandemMod m⁵C model in m⁶A detection, and compared it with the TandemMod m⁶A model of the standard instance. The results show that transfer learning can significantly reduce the cost of training set data volume and model training time while ensuring the same performance.

TandemMod was used to map the epigenetic modification of rice m⁶A, m⁵C and Ψ under high salt stress

Finally, the research team tested the ability of the TandemMod model to be extended to new species for DRS data sequencing, and further verified the reliability of TandemMod using human cell lines (2 modification enzyme knockout samples and 5 wild-type samples). At the same time, the research team also used TandemMod to map the epigenetic modification maps of m⁶A, m⁵C, and Ψ in rice seedlings under high salt stress, and revealed the co-modification of m⁶A and m⁵C in mRNA and the changes in their modification rates under high salt conditions. As shown in the figure above.

RNA modification opens new doors to explore life

Throughout the ages, people have never stopped exploring life. After the RNA world hypothesis was proposed, the argument that RNA is the origin of life has undoubtedly become one of the most convincing answers. Since the first RNA modification was discovered in 1960, it has long been a top priority in the scientific research community and has maintained a high level of attention in recent years.

In addition to Yu Xiang's research group and Yang Jun/Wang Hongxia's team in this paper, as well as the ONT company mentioned in the article, there are many more teams and companies that are also conducting RNA modification research.

For example, in 2021, Professor Meng Jia's team from Xi'an Jiaotong-Liverpool University published an article titled "Attention-based multi-label neural networks for integrated prediction and interpretation of twelve widely occurring RNA modifications" in the journal Nature Communications.

Paper address:https://www.nature.com/articles/s41467-021-24313-3

The article mentions a model MultiRM based on a multi-label deep learning framework with attention mechanism.Not only can 12 widely existing transcriptome sites be predicted simultaneously, but key sequences in the prediction process are also extracted and analyzed, revealing a strong correlation between different types of RNA modifications, which helps to better comprehensively analyze and understand sequence-based RNA modification mechanisms.

Coincidentally, in a 2021 paper titled "Identification of differential RNA modifications from nanopore direct RNA sequencing with xPore" published in Nature Biotechnology,The research team used xPore to identify RNA modifications with high precision from Direct RNA-seq data and analyze differential modification and expression from a single high-throughput experiment.
Paper address:https://www.nature.com/articles/s41587-021-00949-w

These studies are helping us to further open the door to the RNA world, allowing us to further explore the "true meaning of life." Although there are still many bottlenecks to be overcome in the progress of various studies, the continuous challenges of "pioneers" have already made the door to RNA research more open.

References:
1. https://news.sjtu.edu.cn/jdzh/2

Yu Xiang's Research Group at Shanghai Jiao Tong University Released a Transferable Deep Learning Model to Identify Multiple Types of RNA Modifications and Significantly Reduce Computational Costs

2 years ago

Information

Artificial Intelligence

Dataset

Deep Learning

The so-called RNA modification is an important type of post-transcriptional regulation that can be widely involved in various RNA post-transcriptional processing and metabolic pathways.

Research highlights:

* Under the condition of ensuring the same performance, significantly reduce the computing costs such as the amount of training set data and model training time

* TandemMod provides important technical support for the identification of various types of RNA modification sites and epitranscriptome studies in animals, plants and microorganisms

* TandemMod can also be used to detect artificially modified RNA such as RNA vaccines

Paper address:

https://www.nature.com/articles/s41467-024-48437-4
The open source project "awesome-ai4s" brings together more than 100 AI4S paper interpretations and provides massive data sets and tools:

https://github.com/hyperai/awesome-ai4s

Dataset: Targeted training with multiple datasets

In order to train and evaluate the performance of the TandemMod model, the research team used multiple data sets for experiments.

Model Architecture: A Deep Learning Framework

As shown in the figure above,TandemMod consists of data preprocessing, model pretraining, and transfer learning.

Among them, model pre-training consists of 4 main components:

* One-dimensional convolutional neural network (1D-CNN) to extract local features of the original current intensity signal;

* Bi-LSTM (Bi-LSTM), which is used to capture long-term correlations between adjacent signals and improve the ability to understand context in a longer process;

* Attention mechanism, which is used to weight the importance of each feature at different times and improve the model's ability to capture important signals;

* The classifier of the fully-connected layers is responsible for making predictions based on the combined information of all features.

After 2 epochs, all models achieved high accuracy.The ROC-AUCs of hm⁵C, m⁷G, Ψ, and I reached 0.98, 0.95, 0.96, and 0.97, respectively. As shown in Figures a, b, c, and d above.

Experimental results: TandemMod significantly reduces the amount of training set data and model training time

In addition, the research team also compared the TandemMod m⁶A model with tombo, nanom6A and m6Anet, as shown in the figure above.

These results indicate thatTrained using the in vitro DRS dataset, TandemMod provides the most accurate read-level predictions among existing tools.

RNA modification opens new doors to explore life

Paper address:https://www.nature.com/articles/s41467-021-24313-3

References:
1. https://news.sjtu.edu.cn/jdzh/2

Command Palette

Yu Xiang's Research Group at Shanghai Jiao Tong University Released a Transferable Deep Learning Model to Identify Multiple Types of RNA Modifications and Significantly Reduce Computational Costs

Dataset: Targeted training with multiple datasets

Model Architecture: A Deep Learning Framework

Experimental results: TandemMod significantly reduces the amount of training set data and model training time

RNA modification opens new doors to explore life

Command Palette

Yu Xiang's Research Group at Shanghai Jiao Tong University Released a Transferable Deep Learning Model to Identify Multiple Types of RNA Modifications and Significantly Reduce Computational Costs

Dataset: Targeted training with multiple datasets

Model Architecture: A Deep Learning Framework

Experimental results: TandemMod significantly reduces the amount of training set data and model training time

RNA modification opens new doors to explore life

Related News

The Hong Kong University of Science and Technology and Others Proposed the Incremental Weather Forecast Model VA-MoE, Which Has Simplified Parameters by 75% and Still Achieves SOTA performance.

Cambridge University Has Developed a Blood Cell Image Classifier; Its Diffusion Model Aids in Leukemia Detection, Surpassing the Capabilities of Clinical experts.

Trained With Fewer Than 100,000 Structured Data Points, the Swiss Federal Institute of Technology in Lausanne (EPFL) Has Proposed PET-MAD, Achieving Atomic Simulation Accuracy Comparable to Professional models.

Reshaping the Predictive Power of Disordered Protein Assemblies, NVIDIA, MIT, Oxford University, University of Copenhagen, Peptone, and Others Release Generative Models and New benchmarks.

Prediction Accuracy Can Be Improved by 60%, and the Innovative Neural Symbolic Regression Method Can Automatically Derive high-precision Network Dynamics formulas.

Up to 20 Times More Efficient! The University of California Develops OmniCast to Solve the Problem of Error Accumulation in Autoregressive Weather Forecasting models.

A New Approach Combining Explicit Geological Constraints With data-driven Models Has Enabled a Team From Zhejiang University to Improve the Performance and Interpretability of cross-regional Mineral Prospect prediction.

MIT Team open-sources BoltzGen, Enabling the Design of Protein Binders Across Molecular Types, Achieving Nanomolar Affinity for the 66% target.

Selected for NeurIPS 2025, the University of Toronto and Others Proposed a Ctrl-DNA Framework to Achieve "targeted Control" of Gene Expression in Specific cells.

Command Palette

Yu Xiang's Research Group at Shanghai Jiao Tong University Released a Transferable Deep Learning Model to Identify Multiple Types of RNA Modifications and Significantly Reduce Computational Costs

Dataset: Targeted training with multiple datasets

Model Architecture: A Deep Learning Framework

Experimental results: TandemMod significantly reduces the amount of training set data and model training time

RNA modification opens new doors to explore life

Related News

The Hong Kong University of Science and Technology and Others Proposed the Incremental Weather Forecast Model VA-MoE, Which Has Simplified Parameters by 75% and Still Achieves SOTA performance.

Cambridge University Has Developed a Blood Cell Image Classifier; Its Diffusion Model Aids in Leukemia Detection, Surpassing the Capabilities of Clinical experts.

Trained With Fewer Than 100,000 Structured Data Points, the Swiss Federal Institute of Technology in Lausanne (EPFL) Has Proposed PET-MAD, Achieving Atomic Simulation Accuracy Comparable to Professional models.

Reshaping the Predictive Power of Disordered Protein Assemblies, NVIDIA, MIT, Oxford University, University of Copenhagen, Peptone, and Others Release Generative Models and New benchmarks.

Prediction Accuracy Can Be Improved by 60%, and the Innovative Neural Symbolic Regression Method Can Automatically Derive high-precision Network Dynamics formulas.

Up to 20 Times More Efficient! The University of California Develops OmniCast to Solve the Problem of Error Accumulation in Autoregressive Weather Forecasting models.

A New Approach Combining Explicit Geological Constraints With data-driven Models Has Enabled a Team From Zhejiang University to Improve the Performance and Interpretability of cross-regional Mineral Prospect prediction.

MIT Team open-sources BoltzGen, Enabling the Design of Protein Binders Across Molecular Types, Achieving Nanomolar Affinity for the 66% target.

Selected for NeurIPS 2025, the University of Toronto and Others Proposed a Ctrl-DNA Framework to Achieve "targeted Control" of Gene Expression in Specific cells.

Related News

The Hong Kong University of Science and Technology and Others Proposed the Incremental Weather Forecast Model VA-MoE, Which Has Simplified Parameters by 75% and Still Achieves SOTA performance.

Cambridge University Has Developed a Blood Cell Image Classifier; Its Diffusion Model Aids in Leukemia Detection, Surpassing the Capabilities of Clinical experts.

Trained With Fewer Than 100,000 Structured Data Points, the Swiss Federal Institute of Technology in Lausanne (EPFL) Has Proposed PET-MAD, Achieving Atomic Simulation Accuracy Comparable to Professional models.

Reshaping the Predictive Power of Disordered Protein Assemblies, NVIDIA, MIT, Oxford University, University of Copenhagen, Peptone, and Others Release Generative Models and New benchmarks.

Prediction Accuracy Can Be Improved by 60%, and the Innovative Neural Symbolic Regression Method Can Automatically Derive high-precision Network Dynamics formulas.

Up to 20 Times More Efficient! The University of California Develops OmniCast to Solve the Problem of Error Accumulation in Autoregressive Weather Forecasting models.

A New Approach Combining Explicit Geological Constraints With data-driven Models Has Enabled a Team From Zhejiang University to Improve the Performance and Interpretability of cross-regional Mineral Prospect prediction.

MIT Team open-sources BoltzGen, Enabling the Design of Protein Binders Across Molecular Types, Achieving Nanomolar Affinity for the 66% target.

Selected for NeurIPS 2025, the University of Toronto and Others Proposed a Ctrl-DNA Framework to Achieve "targeted Control" of Gene Expression in Specific cells.

Related News

The Hong Kong University of Science and Technology and Others Proposed the Incremental Weather Forecast Model VA-MoE, Which Has Simplified Parameters by 75% and Still Achieves SOTA performance.

Cambridge University Has Developed a Blood Cell Image Classifier; Its Diffusion Model Aids in Leukemia Detection, Surpassing the Capabilities of Clinical experts.

Trained With Fewer Than 100,000 Structured Data Points, the Swiss Federal Institute of Technology in Lausanne (EPFL) Has Proposed PET-MAD, Achieving Atomic Simulation Accuracy Comparable to Professional models.

Reshaping the Predictive Power of Disordered Protein Assemblies, NVIDIA, MIT, Oxford University, University of Copenhagen, Peptone, and Others Release Generative Models and New benchmarks.

Prediction Accuracy Can Be Improved by 60%, and the Innovative Neural Symbolic Regression Method Can Automatically Derive high-precision Network Dynamics formulas.

Up to 20 Times More Efficient! The University of California Develops OmniCast to Solve the Problem of Error Accumulation in Autoregressive Weather Forecasting models.

A New Approach Combining Explicit Geological Constraints With data-driven Models Has Enabled a Team From Zhejiang University to Improve the Performance and Interpretability of cross-regional Mineral Prospect prediction.

MIT Team open-sources BoltzGen, Enabling the Design of Protein Binders Across Molecular Types, Achieving Nanomolar Affinity for the 66% target.

Selected for NeurIPS 2025, the University of Toronto and Others Proposed a Ctrl-DNA Framework to Achieve "targeted Control" of Gene Expression in Specific cells.