Integrating Multi-source Plant Transcriptome Data, Shandong University of Technology and Others Built the PlantLncBoost Model, With a Cross-species lncRNA Prediction Accuracy of up to 96%

In the field of plant science, the study of long noncoding RNA (lncRNA) is gradually becoming a focus. A paper on plant lncRNA research published in 2020 pointed out that lncRNA plays a key role in plant growth, development and environmental adaptation. For example, studies have found that some lncRNAs can regulate the flowering time of plants by interacting with proteins, thereby affecting the reproductive strategy of plants. This fine regulatory mechanism is of great significance for understanding how plants cope with environmental pressures such as climate change.
With the advancement of technology, more and more plant lncRNAs have been identified and characterized. However, due to the poor sequence conservation of lncRNAs between different species, it poses a major challenge to the generalization ability of machine learning models. Taking the CPC and CPAT tools, which were widely used in the early days, as an example, their cross-validation accuracy between Poaceae and Leguminosae plants decreased by 35%-40% compared with homologous species, exposing the core problem of insufficient generalization ability of sequence features.Although boosting models (such as XGBoost and LightGBM) show better anti-overfitting performance when processing high-dimensional data, existing research still lacks systematic optimization of feature engineering.Scientists realize that in order to accurately predict and analyze lncRNAs in plants, new methods that can adapt to this diversity must be developed. In recent years, researchers have proposed a series of strategies, including model selection, hyperparameter optimization, and feature extraction, which aim to improve the accuracy of lncRNA identification.
Recently, Shandong University of Technology, together with Beijing Forestry University, Guangdong Academy of Agricultural Sciences, University of Sao Paulo, Rosalind Franklin University of Medicine, Umeå University, and other research institutions, formed an interdisciplinary team and made a key technological breakthrough in plant lncRNA identification. The research focused on three core aspects: model selection, hyperparameter optimization, and feature engineering.For the first time, 219 new sequence descriptors based on mathematical theories such as Fourier transform and Shannon entropy were incorporated into the feature space, and three core parameters with cross-species discrimination capabilities were screened out from 1,652 candidate features through the recursive feature elimination (RFE) algorithm.The PlantLncBoost model built on this basis achieved an average prediction accuracy of 91.7% in cross-validation of 12 plant data sets from different families and genera, an improvement of 18.2% over existing mainstream tools, providing a systematic solution to the generalization problem of plant lncRNA identification.
The relevant research results have been published in the academic journal New Phytologist under the title "PlantLncBoost: key features for plant lncRNA identification and significant improvement in accuracy and generalization".

Paper address:
More AI frontier papers:
Dataset: Integration of multi-source heterogeneous plant transcriptome data and construction of feature system
In terms of data infrastructure construction, the research team integrated multi-source heterogeneous plant transcriptome data to support model development and verification.
The core dataset used for training in this study covers lncRNA and mRNA sequences from nine angiosperms, including Cinnamomum camphora, Arabidopsis thaliana, and rice.A total of 24,152 lncRNA sequences were obtained from the GreeNC database.The database uses strict quality control standards to ensure high reliability of the data;The equivalent number of mRNA protein sequences came from the Phytozome v.13 database.In the data preprocessing stage, the CD-HIT-EST algorithm was used to remove redundant transcripts with sequence similarity exceeding 80%, and to eliminate noise sequences containing ambiguous nucleotides "N", thus forming a balanced and pure supervised learning training set.
In the model performance evaluation phase, the research team constructed two key test sets.The first is a comprehensive test set, which contains lncRNA sequences of 20 species, ranging from angiosperms such as corn and grapes to algae such as Chlamydomonas reinhardtii and mosses such as Physcomitrella patens. Among them, 13 species were not included in the training set. The species coverage is wide, spanning many major branches of the plant kingdom. The second is a high-confidence experimental validation set. This data set integrates the contents of the EVLncRNAs and PlncDB databases. After deduplication, 358 unique lncRNAs were finally retained, involving 20 species of plants, of which lncRNA sequences of 12 plants were not included in the training and testing process, thereby ensuring strict testing of the model's cross-species generalization ability. These data have undergone systematic redundant filtering, quality screening and cross-group coverage, which not only ensures the accuracy of the training data, but also builds a multi-level verification system.
also,To identify key features for training robust lncRNA models, the research team extracted a set of 1,662 features from the training dataset.This set of features covers traditional sequence-based metrics such as ORF coverage, k-mer frequency, and Fickett score, as well as new mathematical features designed to capture complex sequence patterns. Specifically,Among them, 1,433 features are basic sequence descriptors, 133 features come from numerical sequence mapping and Fourier transform, and there are 78 complex network features and 19 features from Shannon and Tallis entropy.The comprehensiveness and diversity of these features provide a rich information basis for model training and optimization, and help improve the model's ability to identify plant lncRNAs.

PlantLncBoost algorithm: feature collaborative optimization to build an efficient plant lncRNA prediction model
In the process of constructing the plant long non-coding RNA (lncRNA) prediction model PlantLncBoost, the research team achieved efficient and accurate model development through algorithm performance comparison and feature engineering optimization.

During the algorithm selection phase, the research team conducted a comprehensive performance evaluation of three gradient boosting algorithms: CatBoost, XGBoost, and LightGBM, using a five-fold cross-validation method.The results show that CatBoost significantly outperforms the other two algorithms in key indicators such as accuracy (93.92%), sensitivity (99.83%) and F1-score (94.30%).
In addition, the hyperparameter optimization of CatBoost took only 14.45 minutes.Compared with XGBoost's 164.18 minutes and LightGBM's 55.67 minutes, it shows an overwhelming efficiency advantage. At the same time, CatBoost also performs well in model building time and prediction speed, which are 19.41 minutes and less than 10 seconds respectively, making it an ideal choice for processing large-scale genomic data.
In the feature selection stage, the research team used the random forest importance (RFI) strategy to screen core variables from 1,662 candidate features.The model constructed by this method achieved an accuracy of 94.21% and an F1 score of 94.56% in five-fold cross validation, far exceeding the models based on traditional filtering methods such as ANOVA (accuracy 75%-79%).

The research team further evaluated the model performance of the top 1-20 features through model evaluation. As shown in the figure below, it was found that only the ORF coverage, complex Fourier mean and atomic Fourier amplitude of the RFI-3 model wereThe model performance reached its peak, with accuracy and F1 score reaching 94.35% and 94.68% respectively.It is worth noting that when the number of features exceeds 3, the model performance decreases significantly, which verifies the effectiveness of the “lightweight feature set”.

ORF coverage, as a classic biological feature, utilizes the essential difference in the ratio of open reading frames between lncRNA and mRNA. For example, in Arabidopsis, the peak ORF coverage of lncRNA is about 0.2, while the ORF coverage of mRNA is as high as 0.7. As shown in the figure below, this feature provides the model with basic distinguishing ability. The complex Fourier mean and atomic Fourier amplitude are innovative mathematical features based on Fourier transform, which capture the frequency domain signals and structural characteristics of the sequence through complex coding and atomic number coding technology. In the principal component analysis of model plants such as Arabidopsis thaliana, rice (Oryza sativa), and poplar (Populus trichocarpa),The first principal component dominated by these two features explained the classification variance of 97%, which was complementary to the second principal component contributed by ORF coverage and together constructed a cross-species robust discrimination dimension.

final,The PlantLncBoost model integrates the efficient learning ability of the CatBoost algorithm and the discriminative advantages of the three core features.In the 10-fold cross validation, the model surpassed the existing mainstream tools such as LncFinder-plant and CPAT-plant with key indicators such as 94.35% accuracy and 99.96% sensitivity. PlantLncBoost has formed an innovative architecture of "lightweight feature set + high-performance algorithm", providing a solution that combines biological interpretability and engineering practicality for the accurate identification of plant lncRNAs, meeting the needs of large-scale genomic data analysis, and providing a powerful new tool for the cross-species accurate identification of plant lncRNAs.
Multi-level experimental verification shows that PlantLncBoost has leading cross-species prediction performance
In the stage of model performance verification, the research team carefully designed a multi-level experimental system to meet the needs of plant lncRNA prediction in terms of cross-species generalization and reliability.
First, based on a test dataset containing 20 diverse plants (covering seed plants, mosses, and archaea), the research team benchmarked PlantLncBoost against nine mainstream models, including LncFinder-plant and CPAT-plant. As shown in the figure below, the experimental results show thatPlantLncBoost showed comprehensive leading advantages in core indicators such as sensitivity (98.42%), specificity (94.93%), and accuracy (96.63%), and its ROC curve was closer to the ideal prediction area (AUC reached 98.35%).

In particular, in most species, as shown in the following table,PlantLncBoost can achieve a sensitivity of nearly 100% while maintaining a specificity of over 90%, successfully breaking through the performance bottleneck of the traditional model of "high sensitivity with low specificity".In contrast, the accuracy of tools such as CPC2 and PLEK-plant is only between 80% and 90%, showing insufficient adaptability to complex plant lineage data.

In a rigorous test of experimental validation of lncRNA, the research team used a dataset containing 358 high-confidence transcripts. The results showed thatPlantLncBoost successfully identified 357 lncRNAs (detection rate 99.72%), ranking first with LncFinder-plant.CPAT-plant followed closely with a detection rate of 99.16%. The only unidentified wheat lncRNA (TalncRNA18) was found through retrospective analysis that its original annotation relied on an outdated ORF detection tool, while modern multi-feature models predicted that it had a long ORF (encoding a polypeptide of 387 amino acids), suggesting that the transcript may belong to an incorrectly classified coding RNA, which indirectly confirms the rigor of PlantLncBoost's prediction.
Integrating multi-level experimental data, PlantLncBoost demonstrated excellent stability and accuracy in both cross-evolutionary group prediction and high-confidence validation sets, establishing its advanced position in the field of plant lncRNA identification.
Universities and enterprises collaborate to drive breakthroughs in plant lncRNA research and application
In fact, in the field of plant long non-coding RNA (lncRNA) research, university scientific research and corporate innovation are forming a trend of synergistic breakthroughs.
For example, the team led by Deng Xingwang and Zhu Danmeng from the School of Life Sciences at Peking University studied the plant-specific non-coding RNA HID1.It was found that there is a functionally redundant homologous gene HIL1 1.8 kb downstream of the HID1 locus in Arabidopsis.Finally, the molecular mechanism of selective transcriptional inhibition of the non-coding RNA HID1 homologous gene HIL1 was elucidated, and the research results were published in Proceedings of the National Academy of Sciences of the United States of America.
A review study published in "Plant Physiology" by Soledad Traubenik's team at Paris-Saclay University in France in 2024,Through gene expression analysis and RNA sequencing technology, it was found that COOLAIR lncRNA regulates the expression of FLC, a key gene in the vernalization response of Arabidopsis thaliana, by changing its secondary structure.Its dynamic regulation mode under low temperature stress provides a new target for crop stress resistance breeding.
Paper link:
doi.org/10.1093/plphys/kiae034
The single-cell RNA sequencing technology developed by Wolf Reik's team at the University of Cambridge,237 cell-specifically expressed lncRNAs were found in Arabidopsis root tip cells.A plant single-cell lncRNA database (scPlantDB) was established, which integrates 2.5 million cell data from 17 species, providing an open source platform for analyzing the spatiotemporal expression patterns of lncRNAs.
Paper link:
www.plantcell.org/cgi/doi/10.1105/tpc.18.00785
In terms of corporate innovation practices, the US agricultural technology giant Monsanto relies on the BioDirect™ technology platform.Combining genomics with natural compounds to develop new biologics,For example, precision insecticides targeting Colorado golden beetles can effectively control pests while protecting the ecology of beneficial insects.
Syngenta Group from China has achieved the goal of shortening the creation cycle of corn inbred lines from four years to one year by combining doubled haploid technology with gene editing, and has used a high-throughput molecular detection platform to quickly integrate insect-resistant and herbicide-resistant traits. Among the 121 varieties approved in 2023, many indicators are leading the industry.
The full-length lncRNA sequencing technology developed by Chinese biotechnology company Benagen has broken through the detection bottleneck of the Nanopore platform.It can accurately analyze RNA alternative splicing and new transcripts, and has been applied to the research of anthocyanin accumulation in apple peel and neurotoxicity mechanism in zebrafish, promoting the transformation of basic scientific research and agricultural breeding. These practices deeply integrate cutting-edge algorithms with biotechnology, providing intelligent solutions for crop improvement and ecological protection.
In the future, with the deepening of lncRNA research and the continuous advancement of technology, the basic research results of university scientific research teams and the innovative practices of enterprises are expected to further reveal the key role of plant lncRNA in growth, development and environmental adaptation, and transform these results into practical applications, promote the sustainable development of agricultural production, and inject new vitality into global agricultural production and ecological balance.
Reference articles:
1.https://news.pku.edu.cn/jxky/274-284106.htm
2.https://cn.agropages.com/News/printnew-6048.htm
3.https://www.syngentagroup.cn/shouyeguanli/special/240.html
4.https://www.benagen.com/html/shichangyuzhichi/gongsizixun/855.html