Arc Institute Launches Virtual Cell Challenge: Predict Gene Silencing Effects Using AI Models
Arc Institute recently launched the Virtual Cell Challenge, aimed at developing a machine learning model that can predict the effects of silencing a gene in a partially unseen cell type. This task, known as context generalization, holds the potential to revolutionize drug discovery by simulating cellular changes without the need for physical experimentation. Goal The challenge's primary objective is to train a model—likely a neural network—to accurately simulate how a cell behaves when a specific gene is silenced using CRISPR. Such a model could significantly speed up the testing of drug candidates by reducing the reliance on costly and time-consuming lab experiments. Training Data Participants will work with a dataset of approximately 300,000 single-cell RNA sequencing profiles. Each profile provides a transcriptome, a sparse row vector indicating the raw counts of RNA molecules for each gene. Out of the 220,000 cells in the dataset, about 38,000 are control cells, meaning no gene has been silenced. These controls are essential for establishing a baseline and separating the true effects of gene silencing from other sources of variability. Modeling the Challenge One of the key challenges is the inability to measure the cell state before and after gene silencing without destroying the cell. This introduces noise due to natural heterogeneity in the cell population. To combat this, the model uses control cells as a reference point. Formally, the observed gene expression in perturbed cells can be modeled as: [ \hat{X}p \sim \hat{T}_p(\mathcal{D}{\text{basal}}) + H(\mathcal{D}{\text{basal}}) + \varepsilon, \quad \varepsilon \sim P\varepsilon ] Where: - (\hat{X}p) is the predicted transcriptome in the perturbed cell. - (\hat{T}_p) is the state transition function. - (\mathcal{D}{\text{basal}}) is the dataset of control cells. - (H) accounts for the heterogeneity in the control cells. - (\varepsilon) is noise. STATE: The Baseline Model Before launching the challenge, Arc Institute released STATE, a pair of transformer-based models that serve as a strong baseline. These models are: State Embedding Model (SE) SE creates meaningful cell embeddings to improve cross-cell type generalization. It does this by first generating gene embeddings. Each gene is represented by its amino acid sequences, which are fed into ESM2—a 15B parameter Protein Language Model from Facebook AI Research. ESM2 produces an embedding for each amino acid, and these are mean pooled to create a protein isoform embedding. Gene embeddings are then derived by mean pooling the protein isoform embeddings and transforming them using a learned encoder: [ \tilde{g}_j = \text{SiLU}(\text{LayerNorm}(g_j \mathbf{W}_g + \mathbf{b}_g)) ] To create a cell embedding, the top 2048 genes ranked by log fold expression level are selected, and their embeddings are combined into a "cell sentence": [ \tilde{\mathbf{c}}^{(i)} = \left[\mathbf{z}{\text{cls}}, \tilde{\mathbf{g}}_1^{(i)}, \tilde{\mathbf{g}}_2^{(i)}, \ldots, \tilde{\mathbf{g}}_L^{(i)}, \mathbf{z}{\text{ds}}\right] ] The [CLS] token is used as the cell embedding, and the [DS] token helps disentangle dataset-specific effects. Positional embeddings are used to incorporate the magnitude of each gene's expression. State Transition Model (ST) ST, the "cell simulator," takes in the control cell transcriptome or a cell embedding from SE, along with a one-hot encoded vector representing the gene perturbation, and outputs the perturbed transcriptome. The model is a relatively simple transformer with a Llama backbone. Both the control set tensor and the perturbation tensor are processed through independent 4-layer MLP encoders with GELU activations. The model is trained using Maximum Mean Discrepancy (MMD), minimizing the difference between predicted and actual transcriptome distributions. Evaluations The success of the model will be evaluated using three metrics: Perturbation Discrimination, Differential Expression, and Mean Average Error. Perturbation Discrimination Perturbation Discrimination assesses the model's ability to distinguish the effects of different perturbations. It calculates the Manhattan distance between the predicted perturbed transcriptome and all other perturbed transcriptomes, ranking the prediction against the ground truth. The metric is normalized to provide a score between 0 and 1, where 0 indicates a perfect match. [ \text{PDisc}_t = \frac{r_t}{T} ] [ \text{PDiscNorm} = 1 - 2\text{PDisc} ] Where (r_t) is the rank of the ground truth among all perturbed transcriptomes, and (T) is the total number of transcriptomes. Differential Expression Differential Expression evaluates the model's accuracy in identifying genes significantly affected by perturbation. For each gene, a p-value is calculated using a Wilcoxon rank-sum test. The Benjamini-Hochberg procedure is applied to adjust for multiple comparisons. The score is determined by the overlap between the predicted and ground truth sets of differentially expressed genes, normalized by the size of the ground truth set. [ DE_p = \frac{G_{p,pred} \cap G_{p,true}}{n_{p,true}} ] If the predicted set size is larger, the most confidently predicted genes are selected and the same process is followed. Getting Started For those interested in participating, Arc has provided a Colab notebook that guides through the training process for the STATE model. Additionally, STATE models will soon be available on the Hugging Face Transformers library, allowing participants to start with pre-trained models easily: ```python import torch from transformers import StateEmbeddingModel model_name = "arcinstitute/SE-600M" model = StateEmbeddingModel.from_pretrained(model_name) input_ids = torch.randn((1, 1, 5120), dtype=torch.float32) mask = torch.ones((1, 1, 5120), dtype=torch.bool) mask[:, :, 2560:] = False outputs = model(input_ids, mask) ``` Industry Insights and Company Profile Arc Institute’s Virtual Cell Challenge underscores the growing intersection of machine learning and biology, aiming to democratize AI-driven drug discovery. By providing a robust baseline and clear evaluation metrics, Arc is encouraging engineers from various backgrounds to contribute to this groundbreaking research. The institute is known for its interdisciplinary approach and commitment to advancing scientific techniques through cutting-edge technology. This challenge is a testament to Arc’s vision of leveraging AI to solve complex biological problems, potentially leading to faster and more efficient medical breakthroughs.