HyperAI

Following Evo 2, Arc Institute Released the First Virtual Cell Model STATE, With Training Data Involving 70 Different Cell Lines

特色图像

As we all know, the human body is composed of different types of cells - immune cells can trigger inflammatory responses to resist pathogens when infections occur; stem cells have the potential to differentiate and generate a variety of tissue types; and cancer cells achieve abnormal proliferation by evading growth regulatory signals. Although these cells vary greatly in function and morphology, they all have almost the same genome.The uniqueness of cells does not come from differences in the DNA sequence itself, but from how they regulate and use the same genetic information.

In other words, the characteristics of cells come from differences in gene expression, and the gene expression pattern of a cell not only determines which cell type it belongs to, but also determines the cell state it is in. Therefore,By observing changes in gene expression, it is possible to determine whether a cell is healthy, inflamed, or cancerous.On this basis, by measuring the transcriptional responses of cells under chemical or genetic intervention, AI models can learn and predict the transition trajectories of cells between different states, and even predict the effects of unseen interventions.

This type of "virtual cell" model is expected to significantly improve the efficiency of drug development——In the context that each drug is a targeted intervention, it can help scientists screen treatment options more accurately, guide the cell state from disease to health, while reducing side effects and improving clinical success rates from the source.

Today, the virtual cell model has become a reality. The non-profit research organization Arc Institute, which has released the Evo series of models, has joined forces with research teams from universities such as UC Berkeley and Stanford.Launched the virtual cell model STATE, which can predict the response of stem cells, cancer cells and immune cells to drugs, cytokines or genetic interventions.Its training data covers observational data from nearly 170 million cells and interventional data from more than 100 million cells, involving 70 different cell lines, and integrates data from the Arc Virtual Cell Atlas. Experimental results show that State significantly outperforms current mainstream methods in predicting transcriptome changes after intervention. In the test of the Tahoe-100M dataset, it improved by 50% in distinguishing intervention effects, and its accuracy in identifying differentially expressed genes is twice that of existing models.

Currently, STATE has been open sourced for non-commercial use, and related results have been published as a preprint titled "Predicting cellular responses to perturbation across diverse contexts with State."

Paper link:https://go.hyper.ai/1UFMr 

Project open source address:https://github.com/ArcInstitute/state

Fusion of two data sources covering 70 cell lines

STATE consists of two core modules: STATE Transition (ST) and STATE Embedding (SE). Based on this multi-scale framework, it can integrate two types of data sources:Observational data of 167 million cells were used to train the SE model, and data of more than 100 million intervention cells were used to train the ST model.

The details of the single-cell intervention dataset used for ST model training are shown in the figure below. All datasets have been screened to retain only the measurements of 19,790 human protein-coding Ensembl genes, and are uniformly standardized to a total UMI depth of 10,000.

Dataset used for ST model training

in:

* Tahoe-100M dataset:A massive single-cell dataset, a petascale single-cell atlas containing 100 million transcriptome profiles, measuring the effects of 1,100 small-molecule perturbations on each cell in 50 cancer cell lines.

Tahoe-100M Dataset download address:

https://go.hyper.ai/Wqbl0

* Parse-PBMC dataset:The biotechnology company Parse Biosciences has released an open-source single-cell RNA sequencing (scRNA-seq) dataset, which analyzed 10 million cells in 1,152 samples in a single experiment. It is mainly used to study the gene expression characteristics of human peripheral blood mononuclear cells (PBMC) under different conditions.

Parse-PBMC Dataset download address:

https://go.hyper.ai/20nBg

The SE model was trained on 167 million human cells. The data source is shown in the figure below. To avoid data leakage in the context generalization benchmark, the researchers only used 20 cell lines in the Tahoe dataset in training and kept another 5 cell lines as a reserved test set.

Dataset used for SE model training

in,The Arc Institute recently released a large-scale human single-cell expression dataset, scBaseCount, which contains more than 40 million human cells.Covering multiple organs, cell lines, and pathological states. In this study, when processing scBaseCount data, the researchers screened cells with at least 1,000 non-zero expression values and 2,000 UMIs per cell.

STATE, a multi-scale framework based on Transformer

STATE can predict the downstream transcriptome response of cells after perturbation, including changes in gene expression, differentially expressed genes, and the strength of the overall perturbation effect. The architecture integrates multiple levels of information:

* Molecular level:Use embeddings to represent the characteristics of individual genes across experiments and species; 

* Cellular level:Use embeddings to represent the transcriptomic state of individual cells, either the log-normalized expression profile of the cell or the embeddings generated by the STATE Embedding (SE) model;

* Group level:The STATE Transition (ST) model learns the effects of perturbations on an ensemble of cells.

Among them, ST is based on the Transformer architecture and uses the self-attention mechanism to model the transformation process of intervention in a collection of cells. Each cell can be represented by the original gene expression or an embedded vector. The SE module is pre-trained on a variety of heterogeneous datasets and can learn the expression differences between cells and generate expressive vectors that are robust to technical noise and highly sensitive to intervention responses. With the help of the self-attention mechanism, the ST model can flexibly capture complex biological variability without explicit distribution assumptions.

As shown in the figure below, as a multi-scale machine learning framework,STATE can be operated at multiple levels: at the gene, single cell, and cell population levels.Among them, the ST model learns the perturbation effect by training on a collection of perturbed and unperturbed cell populations grouped under shared covariates (such as perturbation type, cell environment, and batch). The ST model can process gene expression profiles directly or compact cell representations from the SE model, which learns information-rich embedding representations from large-scale observational data.

At the same time, this multi-scale architecture enables ST to effectively simulate Perturb-seq experiments in silico and support subsequent analysis tasks such as expression estimation, differential expression analysis, and perturbation effect size estimation.

STATE basic framework

The ST model framework is shown in the figure below. Its input is a collection of unperturbed cell populations and perturbation labels, and its output is the corresponding perturbed cell populations. When cells are represented by gene expression profiles, ST can directly predict the transcriptome at the single-cell level; when using STATE embedding as input, ST first predicts the output embedding and then decodes it into the transcriptome through a multi-layer perceptron (MLP).

ST model framework

The training objective of the ST model is to minimize the maximum mean difference (MMD) loss between the predicted perturbed cell transcriptome and the true observed data.Although ST learns the perturbation effect at the cell distribution level, it still predicts the expression profile after the perturbation for each specific cell. This property is crucial for capturing the distribution structure of cells in the perturbed population.

Experiments have shown that, within a certain threshold, increasing the size of the cell set can significantly reduce the validation loss, which is significantly better than modeling a single cell. In addition, removing the self-attention mechanism leads to a decrease in performance, as shown in Figure D below, which further illustrates the value of the flexible self-attention mechanism based on the set in modeling the cell heterogeneity in the perturbation response.

Effect of cell ensemble size on perturbation prediction performance

The SE model is a supplement to the ST model.Aims to learn cellular embeddings, optimally capturing cell-type-specific gene expression patterns,As shown in Figure A below. SE is particularly useful when the amount of data is small or the experimental noise is large. When used in combination with ST, SE provides a smoother cell state space. This embedding is learned based on a large number of observational single-cell databases, which is equivalent to indirectly using rich observational single-cell data to improve the accuracy of perturbation response prediction, especially when the intervention data is limited.

SE Model Architecture

In terms of architecture, the SE encoder is a dense bidirectional Transformer, and the training goal is to predict log-normalized gene expression. The SE decoder is a smaller and specially designed multi-layer perceptron (MLP) that predicts gene expression based on a combination of learned cell embeddings and target gene embeddings. This asymmetric design in the architecture enables the model to learn cell states that have biological basis and good generalization ability.

STATE leads the way in predicting perturbation effects across cellular environments

The researchers compared STATE with a variety of baseline models, including three machine learning models: CPA, scVI and scGPT, and evaluated them on chemical, signal transduction and gene perturbation datasets. Its evaluation framework covers the three core output categories of Perturb-Seq experiments: gene expression counts, differential expression statistics, and the overall magnitude of the perturbation effect.

To comprehensively evaluate the performance of the model in these dimensions,The researchers developed a set of evaluation indicators, Cell-Eval,As shown in Figure C below, these indicators are both expressive and biologically explanatory, and can provide complementary evaluation perspectives. For example, the degree of overlap of DEGs helps to link the predicted results to specific pathways and give them biological significance; while the perturbation discrimination score can more sensitively capture the fine-grained changes in the perturbation effect and reflect the similarity between the predicted results and the actual perturbation effect.

Cell-Eval, a virtual cell modeling and evaluation framework

In the specific evaluation, for perturbation experiments, the model must be able to effectively distinguish the effects of different perturbations. To this end, the researchers used a perturbation discrimination score evaluation method adapted from Wu et al. in 2024, which ranks the perturbation effects by comparing the similarity between the predicted post-perturbation expression profile and the actual perturbation results. The results show thatThe performance of the STATE model on the Tahoe and PBMC datasets improved by 54% and 29% respectively.As shown in Figure D below.

To directly assess the accuracy of gene expression count predictions, the researchers calculated the Pearson correlation coefficient between the observed perturbation-induced expression changes and the model predictions.The STATE model outperforms the baseline model by 63% on the Tahoe dataset and 47% on the PBMC dataset.As shown in Figure E below.

To evaluate the p-values of differentially expressed (DE) genes predicted by the model, the researchers first calculated the true significantly differentially expressed genes using the perturbation data observed in the experiment and set the FDR threshold to 0.05. The p-values generated by the model predictions were then compared with the true significance level, and the precision-recall (PR) curve was plotted.By calculating the area under the PR curve (AUPRC), it can be found that STATE consistently outperforms all baseline models on all datasets.As shown in Figure F below.

Performance comparison of STATE and baseline models on multiple evaluation tasks

The AUPRC (area under the precision-recall curve) of the STATE model on the gene perturbation dataset is 184% higher than that of the second-ranked model.This result is very obvious in the PR curves of each model on different data sets, as shown in Figure G below.

Predict differentially expressed genes under each perturbation

It is also worth mentioning thatSTATE also supports zero-shot prediction.That is, even in a new cell environment where no perturbation data has been seen during model training, the perturbation effect can be accurately predicted, as shown in the figure below.

STATE enables zero-shot prediction

Furthermore, to demonstrate the practical application scenarios of STATE, the researchers evaluated its ability to detect cell type-specific differential expression, focusing on five cell lines in the Tahoe-100M dataset as shown in Figure A below.

State can detect cell-type-specific gene expression changes caused by perturbations

The researchers identified perturbation conditions with strong cell type specificity by comparing the overlap of the prediction results of STATE and the two baseline models in differentially expressed genes and the Spearman correlation coefficient of log fold change. If the performance is better than the "perturbation mean" baseline, it means that STATE has learned the perturbation effects specific to a certain cell type; if it is better than the "environmental mean" baseline, it means that the model can distinguish the effects of different perturbations in the same cell line, rather than simply predicting the average expression level of each cell line.

In all disturbance conditions,STATE consistently showed a stronger ability to more accurately restore the true order of log fold change of differentially expressed genes.It is significantly better than the two baseline models of environmental mean and perturbation mean, as shown in Figure B above.

In summary, the research team proposed that STATE is the first machine learning model that surpasses simple baselines (such as mean models or linear models) in almost all indicators and multiple datasets in the cellular environment generalization task. In addition, the embedding generated by the cell embedding model SE makes it possible to achieve more effective zero-sample perturbation effect prediction in new cellular environments.

Arc Institute, a non-profit research organization, released a series of important results

The Arc Institute was officially established in 2021 by Patrick Collison, co-founder and CEO of the well-known mobile payment company Stripe, and Silvana Konermann, an assistant professor of biochemistry at Stanford University, and Patrick D. Hsu, an assistant professor of bioengineering at the University of California, Berkeley.

Patrick Collison announced his engagement to Silvana Konermann in June 2019

At the beginning of its establishment,Arc raised $650 million in investment, of which $500 million came from Collison.This move of "a billionaire paying a scientist's wife to stop worrying about research funding" caused widespread discussion in the field that year. The funds will provide up to 8 years of funding for 15 core researchers and a team of research assistants. These researchers are not restricted and can conduct research on complex human diseases in any form.

This non-profit research institute focusing on cutting-edge research and innovation in life sciences is named after Island arcs. Island arcs are archipelagos formed by the uplift at the junction of plates. The founder hopes to bring together researchers from many different institutions and disciplines through the Island Arc Institute to create something new. This is indeed the case. Since its establishment, Arc Institute has launched a series of blockbuster achievements in the field of life sciences.

In February this year, The Arc Institute released the Arc Virtual Cell Atlas, initially integrating over 300 million cell data.The atlas debuted two basic datasets, which were open sourced on February 25, 2025: Tahoe-100M is a new open source perturbation dataset created by Tahoe, containing 100 million cells and 60,000 drug-cell interactions in 50 cancer cell lines; scBaseCount is the first single-cell RNA sequencing dataset from public data. Arc used AI agents to mine and process more than 200 million cell observations representing 21 species from public repositories and standardize them.

In April of the same year,10x Genomics and Ultima Genomics collaborate with the Arc Institute to accelerate development of the Arc Virtual Cell AtlasIts collection of computable single-cell measurement data is being enhanced by 10x and Ultima technologies. By leveraging 10x's Chromium Flex technology, perturbation data is generated at scale at the lowest cost per cell and the highest resolution to help build biological AI models; using Ultima's UG 100 sequencing system and Solaris chemistry to generate more data at a lower cost, and will use UG 100 Solaris Boost (a new high-throughput operating mode currently in early access) to further increase data output.

Looking back, in November 2024,The Arc Institute, in collaboration with Stanford University and UC Berkeley, developed Evo, the first biologically based model trained on DNA at scale.It uses deep learning architecture to parse DNA coding information, and can predict and design at the DNA, RNA and protein levels, covering the biological scale from nucleotides to genomes. Its core value lies in deciphering DNA evolution patterns. The research team used it to design the unknown functional CRISPR system EvoCas9-1 in nature, which was successful after testing only 11 designs. Its sequence is 73% similar to the commonly used Cas9, but it is quite active. In addition, the mobile genetic element IS200/IS605 transposon was successfully designed. It is known as the basic model of generative AI in the field of biology.

February 2025Building on this foundation, the Arc Institute is collaborating with NVIDIA to develop Evo 2, the largest biological AI model to date. Evo 2 is trained on 9.3 trillion nucleotides from over 100,000 species, and can identify gene sequence patterns, accurately predict human pathogenic mutations, and design new genomes that are equivalent to the length of bacterial genomes. Technically, it uses more than 2,000 H100 GPUs on the NVIDIA DGX Cloud platform for training, and uses the StripedHyena 2 architecture. The amount of data processed is 30 times higher than that of its predecessor, Evo 1, and it can analyze millions of nucleotide sequences simultaneously.

In addition, in July 2024, Arc's Goodarzi laboratory collaborated with the Gilbert laboratory to discover that mRNA can actively control its own expression using the newly discovered "RNA switch". In June 2024, Arc's Hsu laboratory discovered the first natural RNA-guided recombinase, which can programmably insert, excise, or reverse any two DNA sequences of interest. This is the first DNA recombinase that uses non-coding RNA for sequence-specific targeting and donor DNA molecule screening. Because this bridging RNA is programmable, it allows users to specify any desired genomic target sequence and any donor DNA molecule to be inserted.

References:
1.https://arcinstitute.org/news
2.https://mp.weixin.qq.com/s/THQTl2HI0mAXXwyykkQI5w