New Algorithm miniQuant Tackles Gene Isoform Quantification Challenges
Scientists have introduced a new algorithm called miniQuant, designed to address the complex challenge of quantifying gene isoforms. This development has significant implications for the field of genomics and could enhance the precision of RNA sequencing techniques. MiniQuant-H, a variant of the miniQuant algorithm, has demonstrated superior performance in various benchmark tests. In simulated data tests, it achieved an average median absolute relative deviation (MARD) of 0.1249, significantly outperforming both existing short-read tools (0.1505-0.3555 MARD) and long-read tools (0.2515-0.9394 MARD). When evaluated with real datasets, the research team utilized data from the LRGASP consortium, which provided a comprehensive set of labeled cDNA sequences. For the ERCC control transcripts, miniQuant-H showed accuracy on par with short-read tools, due to the absence of isoform variation. However, long-read tools often suffer from widespread sampling errors that can distort results. For the more complex SIRV control transcripts, long-read tools generally performed better, but miniQuant-H achieved the lowest average error. The researchers further applied miniQuant to study human embryonic stem cell (ESC) differentiation, revealing critical isoform switching events during the process. By analyzing the differentiation of ESCs into pharyngeal endoderm (PE) and primordial germ cell-like cells (PGCs), they successfully identified 151 (ESC to PE) and 161 (ESC to PGC) genes involved in isoform switching. These discoveries carry important biological implications. For instance, the MAT2B gene, while maintaining overall expression stability, exhibited noticeable isoform variations using the model, which could influence cell proliferation and embryonic development. A notable aspect of these findings is that many significant isoform switching events occur among highly expressed genes (ranging from the 82nd to 99th percentile with TPM values from 30.60 to 1,077.09). Traditional methods, particularly those relying on long-read sequencing, may miss these events if the expression levels are not adequately captured at typical sequencing depths (such as 60 million cDNA-ONT reads). In contrast, miniQuant-H, by integrating short-read data, can accurately detect isoform switching even among genes with high expression levels, overcoming the limitations of sampling errors. Compared to existing integration methods, miniQuant showcases clear technical advantages. StringTieMix, another popular tool, uses a relatively simple strategy of aligning each short read to the longest supporting transcript, which limits its effectiveness in complex datasets. MiniQuant-H, through sophisticated machine learning models and probabilistic functions, achieves more precise and adaptive data integration. This research advances RNA sequencing technologies in two key ways. First, it provides a rigorous mathematical framework for assessing the reliability of gene isoform quantitative methods. Second, it offers software tools that can adaptively select the optimal alignment strategy based on the specific characteristics of the dataset and gene structure. Qian Jijun, a reviewer, noted that "This is the first time a strict scientific approach has been used to inform researchers about which genes are complex, which are simple, and when different sequencing technologies should be chosen. Previously, decisions were based on intuition and experience; now we provide a scientific standard." Another reviewer praised the study for "Answering long-standing questions about isoform switching that have puzzled the community." Currently, the miniQuant software is available on the GitHub platform (https://github.com/Augroup/miniQuant), complete with pre-trained models for various sequencing platforms and depths, including cDNA-PacBio, cDNA-ONT, and dRNA-ONT. As long-read sequencing technologies continue to evolve, improving both cost and precision, this intelligent integration of long and short reads aims to offer more accurate and cost-effective solutions for transcriptome studies, pushing the boundaries of isoform function research to deeper levels.