HyperAIHyperAI

IJCAI 2025 | 7 Datasets Validation: scSiameseClu Achieves SOTA Performance in Unsupervised single-cell Clustering Tasks

特色图像

In the past, the focus of life science research has often been on the "population" level. Through traditional bulk RNA-Seq, we can obtain the average gene expression of cells in a population, but this means that the characteristics of some rare cells may be obscured.Today, researchers increasingly hope to hear the voices of "single" cells.

Single-cell RNA sequencing (scRNA-seq) is such a revolutionary technology that can capture the comprehensive genetic information of a single cell amidst the hustle and bustle of a cell population, thereby revealing hidden complex features. In order to understand this complex information,A key step is required - cell clustering.Grouping cells based on similarities in gene expression is a challenging process.

scRNA-seq data is characterized by high noise, high sparsity, and high dimensionality. Even the most effective graph neural network (GNNs) method currently has problems of "insufficient graph construction" and "representation collapse."As shown in the figure below, both the deep learning-based scNAME and the graph neural network-based scGNN have gradually converged in their representational results, indicating varying degrees of representational collapse. In other words, a clustering tool that can truly preserve cellular differences is still lacking.


Similarity distribution of cell embeddings between scNAME and scGNN on the same dataset

To address this dilemma, research teams from the Chinese Academy of Sciences, Northeast Agricultural University, the University of Macau, and Jilin University jointly proposed a novel twin clustering framework, scSiameseClu, for interpreting single-cell RNA-seq data. It aims to capture and refine complex intercellular information while simultaneously learning discriminative and robust representations at the gene and cell level.The framework integrates three key modules: dual enhancement, twin fusion, and optimal transmission clustering.Through this design, scSiameseClu can effectively alleviate the problem of representation collapse, achieve clearer cell population classification, and provide a powerful tool for the analysis of scRNA-seq data.

The related research, titled "scSiameseClu: A Siamese Clustering Framework for Interpreting single-cell RNA Sequencing Data," was selected for IJCAI 2025, and a preprint has been published on arXiv.

Research highlights:

* scSiameseClu can capture complex information from gene expression and cell maps to learn discriminative and robust cell embeddings, improving clustering results and downstream tasks;

* Introduced key modules and built a complete framework of "enhancement-fusion-clustering";

* scSiameseClu outperforms SOTA methods in clustering and other biological tasks.

Paper address:

https://go.hyper.ai/00BhP

Follow the official account and reply "Twin Clustering Framework" to get the complete PDF

More AI frontier papers:
https://hyper.ai/papers

7 real-world datasets covering multiple tissues and species

To comprehensively evaluate the performance of scSiameseClu, the research team conducted experiments on seven real scRNA-seq datasets.Genes expressed in fewer than three cells were filtered out, normalized, log-transformed (logTPM), and highly variable genes were selected based on predefined mean and dispersion thresholds. These preprocessed datasets consist of three mouse samples and four human samples, covering a variety of cell types (e.g., retina, lung, liver, kidney, and pancreas), with varying gene counts, cell type numbers, and sparsity. The following image provides an overview of the datasets used.


Overview of 7 scRNA-seq datasets

The three modules of the twin clustering framework

The scSiameseClu proposed by the research team is a twin clustering framework based on enhanced graph autoencoders. The framework consists of three modules:

(i) Dual Augmentation Module;

(ii) Siamese Fusion Module;

(iii) Optimal Transport Clustering for Self-Supervised Learning.


scSiameseClu Architecture Overview

Dual Enhancement Module

The dual enhancement module in this study is "Gene expression enhancement + cell map enhancement",To improve the model's robustness to noise and its generalization ability on different datasets, the research team added Gaussian noise to simulate the natural fluctuations in gene expression, thereby enhancing robustness at the gene level. By adopting edge perturbation and graph diffusion strategies, they generated enhanced adjacency matrices, respectively, processing the cell graph from different but complementary perspectives, enabling the model to capture the diverse interactions between cells.

Twin fusion module

The Twin Fusion Module (SFM) is the core innovative design of scSiameseClu.A strategy integrating "cross-correlation refinement" and "adaptive information fusion" is adopted.Specifically, the former constructs an autoencoder to process the enhanced gene expression matrix and cell map matrix separately, and align and fuse them in the latent space; the latter integrates cell relationships through embedding aggregation, autocorrelation learning and dynamic reorganization, effectively filtering out redundant information and retaining the discriminative features in the latent space, enabling it to learn robust and meaningful representations, thereby improving clustering performance while avoiding representation collapse.

In addition, the framework introduces a propagation regularization term to constrain the consistency of the original embedding and the embedding after graph propagation using the Jensen-Shannon divergence, alleviating the over-smoothing problem of graph neural networks while maintaining information flow.

Optimal Transmission Clustering

The research team first used Student's t-distribution to calculate the similarity between cells and cluster centers, and then used the Sinkhorn algorithm to align and correct the predicted distribution.This ensures the balance of cluster distribution and avoids the collapse problem.

Multiple validations of the superior performance of the scRNA-seq framework

The superior performance of the scRNA-seq framework in clustering is the result of extensive experimental validation. First, a comprehensive comparison with mainstream methods was conducted. The research team selected nine state-of-the-art benchmark models, including traditional clustering methods, methods based on deep neural networks, and clustering methods based on graph neural networks. Using the seven real-world datasets mentioned above, the team evaluated the performance using three widely recognized clustering metrics: Accuracy (ACC), Normalized Mutual Information (NMI), and Adjusted Rand Index (ARI).

The results show that scSiameseClu has a clear advantage in all three indicators. Not only is the overall score higher, but the performance is also stable across different datasets.As can be seen from the visual comparison of the human liver cell dataset, scSiameseClu can generate clusters with clear boundaries and good separation compared to other benchmark models, and can effectively distinguish different cell types.


Visualization results of scSiameseClu and four typical benchmark methods on human hepatocytes

Secondly, in downstream experiments, the research team performed cell type annotation. In a human pancreas dataset, they used the Seurat tool to identify differentially expressed genes and marker genes. They then compared the top 50 marker genes identified by scSiameseClu and other methods with the gold standard. The results showed that most clusters had a similarity exceeding 90%, accurately mapping to known cell types. The model also identified the marker genes for each cluster.

Further cell classification experiments also showed that scSiameseClu outperformed the baseline model in multiple indicators such as accuracy and F1 value, verifying its advantages in revealing cell heterogeneity and type discrimination.


Overlap of differentially expressed genes with gold standard cell types


Classification performance comparison

Finally, in ablation experiments, the research team removed key components of scSiameseClu (including the SFM loss, ZINB loss, and OTC loss) from the Shekhar mouse retinal cell dataset and compared them with the full model to assess the effectiveness of each module of the framework. The results showed that each component significantly improved performance, while the absence of any one component led to a decrease in performance. Further disassembling the SFM module, removing cell-dependent refinement, potential-dependent refinement, propagation regularization, and reconstruction loss showed degradation in performance. However, scSiameseClu, with all components included, showed significant performance improvements, demonstrating its effective integration of genomic and cellular information.


Shekhar ablation experiments on the mouse retinal cell dataset

Towards a new era of flourishing computational biology

From the perspective of computational biology, scSiameseClu has effectively solved the long-standing problem of analyzing cellular heterogeneity in biology by leveraging methods such as double enhancement, twin fusion, and optimal transmission clustering in computer science.It can be said that it is just a new type of clustering tool and one of many emerging attempts in the field of deep integration of computational methods and life sciences.In addition, with the rapid development of artificial intelligence algorithms and biology, new results are constantly emerging.

Professor Zhang Yang's team at the National University of Singapore has proposed a high-precision deep learning-based RNA structure prediction framework, DRfold2. DRfold2 integrates a pretrained RNA composite language model (RCLM) and a denoised structure module for end-to-end RNA structure prediction. Their findings have been published on the bioRxiv preprint platform under the title "Ab initio RNA structure prediction with composite language model and denoised end-to-end learning."
Paper address:
https://www.biorxiv.org/content/10.1101/2025.03.05.641632v1

A research team from Baylor College of Medicine in the United States has proposed a deep learning-based framework for predicting protein post-translational modifications, called DeepMVP. DeepMVP integrates the high-quality PTMAtlas dataset to accurately predict PTM sites and alterations caused by missense variants. Their findings were published in Nature Methods under the title "DeepMVP: deep learning models trained on high-quality data accurately predict PTM sites and variant-induced alterations."
Paper address:
https://www.nature.com/articles/s41592-025-02797-x