HyperAIHyperAI

Command Palette

Search for a command to run...

Arc Institute Launches Virtual Cell Challenge: Predict Gene Silencing Effects Using AI Models

Arc Institute recently launched the Virtual Cell Challenge, aimed at developing a machine learning model that can predict the effects of silencing a gene in a partially unseen cell type. This task, known as context generalization, holds the potential to revolutionize drug discovery by simulating cellular changes without the need for physical experimentation. The challenge's primary objective is to train a model—likely a neural network—to accurately simulate how a cell behaves when a specific gene is silenced using CRISPR. Such a model could significantly speed up the testing of drug candidates by reducing the reliance on costly and time-consuming lab experiments. Participants will work with a dataset of approximately 300,000 single-cell RNA sequencing profiles. Each profile provides a transcriptome, a sparse row vector indicating the raw counts of RNA molecules for each gene. Out of the 220,000 cells in the dataset, about 38,000 are control cells, meaning no gene has been silenced. These controls are essential for establishing a baseline and separating the true effects of gene silencing from other sources of variability. One of the key challenges is the inability to measure the cell state before and after gene silencing without destroying the cell. This introduces noise due to natural heterogeneity in the cell population. To combat this, the model uses control cells as a reference point. Before launching the challenge, Arc Institute released STATE, a pair of transformer-based models that serve as a strong baseline. These models are: State Embedding Model (SE) SE creates meaningful cell embeddings to improve cross-cell type generalization. It does this by first generating gene embeddings. Each gene is represented by its amino acid sequences, which are fed into ESM2—a 15B parameter Protein Language Model from Facebook AI Research. ESM2 produces an embedding for each amino acid, and these are mean pooled to create a protein isoform embedding. State Transition Model (ST) ST, the "cell simulator," takes in the control cell transcriptome or a cell embedding from SE, along with a one-hot encoded vector representing the gene perturbation, and outputs the perturbed transcriptome. The model is a relatively simple transformer with a Llama backbone. Both the control set tensor and the perturbation tensor are processed through independent 4-layer MLP encoders with GELU activations. The model is trained using Maximum Mean Discrepancy (MMD), minimizing the difference between predicted and actual transcriptome distributions. For those interested in participating, Arc has provided a Colab notebook that guides through the training process for the STATE model. Arc Institute’s Virtual Cell Challenge underscores the growing intersection of machine learning and biology, aiming to democratize AI-driven drug discovery. By providing a robust baseline and clear evaluation metrics, Arc is encouraging engineers from various backgrounds to contribute to this groundbreaking research. The institute is known for its interdisciplinary approach and commitment to advancing scientific techniques through cutting-edge technology. This challenge is a testament to Arc’s vision of leveraging AI to solve complex biological problems, potentially leading to faster and more efficient medical breakthroughs.

Related Links