Selected for AAAI 2025, Zhejiang University Proposed a Many-to-one Regression Model M2OST, Which Uses Digital Pathology Images to Accurately Predict Gene Expression

Digital pathology images, as whole-slice pathology images (WSIs), can present tissue slices in high resolution and digitally, fully displaying cell morphology, structure, and spatial distribution characteristics. Compared with traditional glass slices, WSIs are not only easier to store and analyze, but also provide more intuitive tissue views at multiple scales, and are therefore increasingly used in pathological diagnosis and biomedical research. By analyzing these images, researchers can explore the intrinsic connection between cellular spatial organization and gene expression, thereby revealing the complex transcriptional regulation mechanisms in multicellular systems.
In recent years, spatial transcriptomics (ST), as a spatial extension technology of single-cell RNA sequencing, has become an important tool for studying the distribution, interaction and molecular mechanisms of cell subtypes. However, due to its high cost of equipment and reagents, ST technology still faces the problem of popularization in practical applications. In contrast, WSIs are more economical and accessible, and are more economical and easy to obtain in clinical applications. Therefore, how to reconstruct ST maps from WSIs at low cost with the help of deep learning has become a research direction that has attracted much attention.
Most existing methods treat the ST prediction problem as a traditional regression problem and use single-level image-label pairs for training. This allows them to only model gene expression relationships in images with the maximum magnification, wasting the inherent multi-scale information of WSIs.
Based on this problem,Professor Lin Lanfen's research team from Zhejiang University in China, together with Zhejiang Hangzhou Zhijiang Laboratory and Ritsumeikan University in Japan, jointly proposed M2OST, a many-to-one regression Transformer model that aims to jointly predict gene expression using pathological images at different levels.By integrating the visual information of the sampling points and multi-scale features in WSIs, the model can generate more accurate ST maps. In addition, the research team also decoupled the many-to-one multi-layer feature extraction process into intra-layer feature extraction and cross-layer feature extraction, which greatly reduced the computational cost and optimized the computational efficiency without affecting the model performance.
The related results were selected for AAAI 2025 under the title "M2OST: Many-to-one Regression for Predicting Spatial Transcriptomics from Digital Pathology Images".
Research highlights:* Conceptualize the ST prediction problem as a many-to-one modeling problem, and jointly predict the ST map by using the multi-scale information and inter-point features embedded in the hierarchical WSIs * Propose a Transformer model M2OST based on many-to-one regression, which is robust to input sets of different sequence lengths
* Decouple the multi-scale feature extraction process in M2OST into intra-layer feature extraction and cross-layer feature extraction, which significantly improves the computational efficiency without affecting the model performance
* Comprehensive experiments were conducted on the proposed M2OST method and its effectiveness was demonstrated on three public ST datasets.

Paper address:
https://hyper.ai/cn/sota/papers/2409.15092
focus on HyperAlWeChat public account, reply "M2OST" in the backstage to get the complete PDF
The open source project "awesome-ai4s" brings together more than 200 AI4S paper interpretations and provides massive data sets and tools:
https://github.com/hyperai/awesome-ai4s
Dataset: Use 3 ST datasets to demonstrate its effectiveness
The research team used three public ST datasets to evaluate the performance of the proposed M2OST model:
*Human Breast Cancer Dataset (HBC):Contains 30,612 spots in 68 WSIs, each with up to 26,949 different genes. The spots in this dataset are 100 μm in diameter and arranged in a grid with a center spacing of 200 μm.
*Human positive breast tumor dataset (HER2):It consists of 36 pathological images and 13,594 points, each of which contains 15,045 recorded gene expression data. The center distance between each captured point of ST data in this dataset is 200μm, and the diameter of each point is 100μm.
*Human Cutaneous Squamous Cell Carcinoma Dataset (cSCC):Includes 12 WSIs and 8,671 spots. Each spot in this dataset analyzes 16,959 genes. All spots have a diameter of 110 μm and are arranged in a medium rectangular array with a center distance of 150 μm.
M2OST model: many-to-one regression structure, multi-level pathological images jointly predict gene expression
In recent years, predicting spatial transcriptome (ST) profiles from whole slide pathology images (WSIs) has become a research hotspot in the field of digital pathology. Early methods such as ST-Net and DeepSpaCE predict ST at the image block level based on convolutional neural networks (CNNs). The recently released bimodal embedding framework BLEEP introduces a contrastive learning strategy to align WSI image block features with ST point embeddings, and introduces the K nearest neighbor algorithm to alleviate the batch effect problem in the inference stage.
With the rise of Transformer-based models, their performance has surpassed traditional CNN. The deep learning model HisToGene first introduced Transformer into gene expression prediction, achieving slide-level modeling, improving efficiency but still limited by computing resources. On this basis, the Hist2ST model combines CNN, Transformer and graph neural network to further capture long-distance dependencies.However, its complex model structure also leads to an increased risk of overfitting.
Different from the mainstream idea of focusing on the correlation between sampling points, iStar, a method based on hierarchical image feature extraction, emphasizes that the gene expression within the sampling point is only related to its corresponding image block area, uses pre-trained HIPT for feature extraction, and maps it to expression values through MLP, with excellent performance.However, since the features are not learnable, there is still room for further optimization.
Inspired by this, the research teamM2OST also uses an image block level solution.Predict one sample point at a time to ensure the independence and accuracy of each prediction.The research team also further expanded the ideas of iStar and designed a set of learnable multi-scale feature extraction and fusion modules. Through detailed modeling of local areas and cross-scale information integration, the model's predictive ability under complex organizational structures was improved.
As shown in the figure below, three image patch sequences from different whole-slice pathology images (WSIs) levels are input into the model to jointly predict the gene expression of the corresponding sites.
After receiving the pathological image blocks from 3 different levels,First, M2OST feeds them into the Deformable Block Embedding (DPE) layer.To achieve adaptive token generation. DPE can not only extract basic pathological image patches from each image, but also introduce larger-sized image patches in high-level pathological images to capture more extensive contextual information.
At the same time, DPE generates fine-grained intra-point tokens and coarse-grained surrounding tokens to strengthen the model's focus on the features of the central area of the sampling point, thereby highlighting the inter-spot features in the many-to-one modeling process and providing a more refined and structured feature representation for subsequent expression prediction.

M2OST model diagram

DPE used in M2OST
Then, cls token is added to each sequence, and as shown in the PE in the figure, learnable position encoding is introduced. M2OST uses the inner token hybrid module (ITMM) to extract intra-layer features for each sequence. ITMM is built on the Vision Transformer architecture and introduces the Random Mask Self-Attention mechanism (Rand Mask Self-Attn) to enhance the generalization ability of the model in the image modeling process.

ITMM's network structure
After the intra-layer feature extraction is completed, M2OST introduces a cross-layer token mixing module (CTMM) to promote cross-layer information interaction between multi-layer sequences.Due to the differences in the lengths of multi-scale input sequences, CTMM introduces a fully connected cross-layer attention mechanism to avoid information distortion caused by direct fusion while maintaining the relative independence of the parameters of each scale branch.Subsequently, in order to enhance the cross-scale information exchange capability at the channel level, M2OST introduced a cross-layer channel mixing module (CCMM) after CTMM.
CCMM adopts a structure design that is insensitive to sequence length.CTMM dynamically integrates cross-scale contextual information based on the attention similarity and learnable weights between different layers, and outputs multi-layer sequences of the same shape.First, the sequence of each layer is globally averaged pooled to compress its sequence information into a token representation, and then the tokens of different layers are combined together, and the cross-layer channel attention scores are calculated in combination with the squeeze incentive mechanism. These scores are then mapped back to their respective input sequences to complete the cross-scale information exchange at the channel level.

(a) The network structure of CTMM. (b) The network structure of CCMM.
This multi-scale feature modeling process as a whole constitutes the encoder module of M2OST and is iterated N times throughout the network to gradually enrich the multi-level, highly expressive image representation required for spatial transcriptome prediction.at last,The three cls tokens are connected and fed into the linear regression head for ST point prediction.
Experimental results: Multi-dimensional evaluation proves the effectiveness of the M2OST model
The research team comprehensively compared the performance of M2OST with a variety of mainstream methods on multiple data sets. The experimental results are shown in the following table.M2OST achieves superior performance with fewer parameters and fewer FLOPs.Compared with ST-Net, the number of parameters of M2OST is reduced by 0.40M, FLOPs is reduced by 0.63G, and the Pearson correlation coefficient (PCC) of M2OST on HER2+ and cSCC datasets is improved by 1.16% and 1.13% respectively.

Comparative experimental results of M2OST and other methods
Comparison of M2OST with one-to-one multi-scale methods:
The research team also compared M2OST with common one-to-one multi-scale methods such as CrossViT and HIPT/iStar. Compared with standard ViT, CrossViT showed stronger ST regression capabilities, confirming the significant advantages of integrating multi-scale information in this task. However, CrossViT has certain limitations in modeling intra-point information, and its overall performance is still inferior to M2OST.
In addition, iStar performs well in ST prediction accuracy, demonstrating the effectiveness of the HIPT architecture in extracting multi-scale features from WSI. However, in order to save computational costs, iStar uses fixed HIPT weights to generate WSI features for ST prediction, limiting its feature extraction capabilities. At the same time, in terms of inference efficiency, iStar's block-by-block, scale-by-scale extraction process significantly increases processing time. The results show that when running under the same GPU memory limit, M2OST's inference speed is about 100 times faster than iStar, and its performance is still better than the latter, fully demonstrating the potential of end-to-end training in ST regression tasks and the effectiveness of the M2OST model.
Comparison of image patch-level and slide-level ST methods:
Experimental results show that the performance of the slide-level methods on the three datasets is generally inferior to that of the image block-level methods. Although Hist2ST shows stronger performance than HisToGene, its large number of parameters and high FLOPs make this performance improvement insignificant. Compared with baseline image block-level methods such as ST-Net, the PCC of Hist2ST on the three datasets is reduced by 2.78%, 2.99%, and 2.66%, respectively. This shows that the gene expression of a point is mainly related to its corresponding tissue region, and the introduction of inter-point correlation does not significantly improve the prediction accuracy. Despite this, the slide-level method is still superior in generating complete ST maps. It is more efficient, and there is still potential to achieve competitive regression accuracy in the future by optimizing the network design.
Visual analysis:

(a) Visualization of the spatial transcriptome (ST) map after principal component analysis (PCA). (b) Visualization of the spatial distribution of the DDX5 gene.
The research team analyzed and compared the visualization results of different methods in ST map prediction. The results showed that slide-level methods (such as HisToGene and Hist2ST) can usually generate smoother maps, while image block-level methods retain clearer local structural features.
It is noteworthy that M2OST consistently generates more accurate ST maps, showing higher prediction accuracy. The research team further visualized the expression of the key gene DDX5, which plays a key role in the proliferation and tumorigenesis of non-small cell cancer cells by activating the β-catenin signaling pathway. The results showed that M2OST performed best in the prediction of this gene, outperforming all the comparison methods, verifying the accuracy of the M2OST model at the single gene expression prediction level.
Breakthrough progress and cross-domain applications of spatial transcriptomics
Spatial transcriptomics, as a bridge connecting cell function and tissue structure, can analyze the gene expression patterns of individual cells in time and space, and reveal the spatial location and biological characteristics of cell populations, which is driving biomedical research to a deeper level.
In this area, by April 2025,A research team from the Institute of Medical Science, The University of Tokyo, Japan, developed a deep learning framework STAIG for spatial transcriptomics analysis based on image-assisted graph contrast learning.The framework is able to integrate gene expression, spatial data, and histological images without the need for data alignment, thus overcoming the limitations of traditional methods in eliminating batch effects and identifying spatial regions. STAIG extracts features from hematoxylin and eosin (H&E) stained images through self-supervised learning without relying on large-scale datasets for pre-training.
During the training process, STAIG dynamically adjusted the graph structure and selectively excluded irrelevant negative samples through histological images, reducing bias. Finally, STAIG successfully achieved batch integration by analyzing the commonalities of gene expression through local comparison, avoiding the complexity of manual coordinate alignment and significantly reducing batch effects. Studies have shown that STAIG performs well on multiple datasets, especially in spatial region recognition, and can reveal detailed genetic and spatial information in the tumor microenvironment, showing its important potential for parsing the complexity of spatial biology.
At the same time, Wei Wu's research team at the Lingang Laboratory in Shanghai, China, has also made significant progress in the field of spatial transcriptomics. In November 2024, the team published a research paper titled "MCGAE: unraveling tumor invasion through integrated multimodal spatial transcriptomics" in the journal Briefings in Bioinformatics. The study developed a deep learning framework MCGAE (Multi-View Contrastive Graph Autoencoder) designed specifically for spatial transcriptome data analysis.This framework creates multimodal, multi-view biological representations by combining gene expression, spatial coordinates, and image features, significantly improving the accuracy of spatial domain recognition.The tumor data demonstrates precise identification of tumor regions and in-depth analysis of molecular regulatory characteristics, providing a powerful tool for complex tissue research, disease mechanism research, and drug target discovery.
Original paper:
https://academic.oup.com/bib/article-pdf/26/1/bbae608/60786360/bbae608.pdf
In addition, the application of spatial transcriptomics in the agricultural field also shows great potential. In April 2025, a research team from the Institute of Modern Agriculture of Peking University published an important study entitled "Spatiotemporal tranomics reveals key gene regulation for grain yield and quality in wheat" in Genome Biology.Using spatial transcriptome technology, a high-resolution gene expression map of wheat grains at different time periods during early development was constructed.The study revealed the gene expression characteristics during wheat grain development. This study not only provides important theoretical support for molecular design breeding and yield improvement of wheat, but also provides a strong guarantee for global food security.
Original paper:
https://www.biorxiv.org/content/biorxiv/early/2024/06/03/2024.06.02.596756.full.pdf
In the future, with the continuous accumulation of spatial transcriptome data and the continuous optimization of digital pathology image acquisition methods, the deep integration of artificial intelligence and omics technology will promote the widespread application of deep learning models in various tissue types and disease backgrounds, and help the development of precision medicine. The introduction of M2OST has laid a solid foundation for building an efficient, low-cost, and high-precision spatial gene expression prediction framework, heralding the far-reaching prospects of artificial intelligence and multi-omics data fusion analysis in the biomedical field.