HyperAIHyperAI

Command Palette

Search for a command to run...

Massive Biological Datasets Fuel AI Breakthroughs in Genome Regulation and Disease Research

In June, Google DeepMind unveiled AlphaGenome, a groundbreaking machine learning model designed to predict how genetic variations influence gene regulation—the processes that control when and where genes are turned on or off. Unlike AlphaFold, which focuses on protein folding, AlphaGenome tackles the complex regulatory layer of the genome, helping scientists understand the functional impact of noncoding DNA variants, many of which are linked to disease. The model’s training relied heavily on two foundational biological datasets created over the past decade: the Encyclopedia of DNA Elements (ENCODE) Consortium and the Genotype-Tissue Expression (GTEx) Project. ENCODE mapped over a million regulatory elements across the human genome, revealing that a significant portion of the noncoding DNA—previously dismissed as “junk”—plays a crucial role in gene control. GTEx expanded on this by systematically measuring how genetic variants affect gene expression across diverse human and primate tissues, providing vital context for understanding disease risk. These resources, developed at the Broad Institute and made freely available to the scientific community, have become cornerstones of modern genomics. They laid the foundation for large-scale initiatives like the NIH’s Impact of Genomic Variation on Function Consortium, the Human Cell Atlas, and the Broad’s Gene Regulation Observatory (GRO). Kristin Ardlie, director of GTEx and an institute scientist at the Broad, emphasized that the long-term value of these datasets lies in their open, utility-driven design. “We built them to be community resources with no restrictions,” she said. “Now, more than a decade later, they’re enabling advances we couldn’t have imagined—like AlphaGenome.” Brad Bernstein, an institute member and leader of both ENCODE and the GRO, echoed this sentiment. He noted that ENCODE’s original goal was to decode the “language” of the genome, challenging the notion that noncoding regions were inert. GTEx followed by linking genetic variants to real-world biological effects in tissues. The emergence of models like AlphaGenome highlights how foundational datasets are now fueling the next wave of AI-driven discovery. These models can now interpret complex regulatory patterns at unprecedented resolution, much like adjusting the lenses in an eye exam to see finer details. Beyond AlphaGenome, AI is being applied in multiple ways to decode genome regulation. Jason Buenrostro’s team uses deep learning to study how regulatory elements near genes are organized during cell development. Anders Hansen applies AI to map the 3D structure of the genome, crucial for understanding long-range interactions. Bernstein’s own lab collaborated with Google to develop a general model of the genome’s regulatory code that can be applied to any cell type. Looking ahead, Ardlie stressed the need for more data on biological perturbations—such as development and disease progression—captured across the full continuum from healthy to diseased states. She also highlighted the importance of interpreting variants found in genetic testing, many of which are regulatory and currently unclassified. Bernstein pointed to a critical gap: while we have vast data on gene activity and transcription factor binding, we lack large-scale, systematic data on how genetic perturbations affect cells. He envisions future experiments that systematically mutate genes in specific cell types to uncover the genome’s regulatory rules. The ultimate question, he said, is whether we should study variants one at a time or use AI to discover overarching principles. With models like AlphaGenome, science may finally be poised to answer that.

Related Links