vor 10 Stunden

Arnav Shah Junzhe Li Parsa Idehpour Adibvafa Fallahpour Brandon Wang Sukjun Hwang Bo Wang Patrick D. Hsu Hani Goodarzi Albert Gu

Inhaltsverzeichnis

Zusammenfassung

Genomische Foundation-Modelle besitzen das Potenzial, die Syntax der DNA zu entschlüsseln, stehen jedoch vor einem grundlegenden Kompromiss bei ihrer Input-Repräsentation. Standardmäßige Tokenizer mit festem Vokabular fragmentieren biologisch bedeutsame Motive wie Codons und regulatorische Elemente, während Modelle auf Nukleotid-Ebene die biologische Kohärenz bewahren, jedoch prohibitiv hohe Rechenkosten bei langen Kontexten verursachen.Wir führen dnaHNet ein, ein hochmodernes, tokenizer-freies autoregressives Modell, das genomische Sequenzen end-to-end segmentiert und modelliert. Durch einen differenzierbaren Mechanismus zum dynamischen Chunking (differentiable dynamic chunking mechanism) komprimiert dnaHNet rohe Nukleotide adaptiv in latente Tokens und balanciert dabei die Kompression mit der Vorhersagegenauigkeit. Prätrainiert auf prokaryotischen Genomen, übertrifft dnaHNet führende Architekturen, einschließlich StripedHyena2, in Bezug auf Skalierbarkeit und Effizienz. Dieses rekursive Chunking führt zu einer quadratischen Reduktion der FLOPs, was eine mehr als 3-fache Beschleunigung der Inferenz gegenüber Transformers ermöglicht.Bei Zero-Shot-Aufgaben erzielt dnaHNet eine überlegene Leistung bei der Vorhersage der Fitness von Proteinvarianten sowie der Gen-Essentialität, während es gleichzeitig hierarchische biologische Strukturen ohne Supervision automatisch entdeckt. Diese Ergebnisse etablieren dnaHNet als ein skalierbares und interpretierbares Framework für die genomische Modellierung der nächsten Generation.

One-sentence Summary

By employing a differentiable dynamic chunking mechanism to adaptively compress raw nucleotides into latent tokens, the tokenizer-free autoregressive foundation model dnaHNet achieves superior scaling and efficiency over architectures like StripedHyena2 and Transformers, delivering over 3x inference speedup and state-of-the-art performance on zero-shot tasks such as predicting protein variant fitness and gene essentiality.

Key Contributions

The paper introduces dnaHNet, a tokenizer-free autoregressive model that utilizes a differentiable dynamic chunking mechanism to adaptively compress raw nucleotides into latent tokens.
This architecture achieves significant computational efficiency by employing recursive chunking to reduce FLOPs quadratically, resulting in over 3x faster inference speeds compared to Transformer models.
Experiments on prokaryotic genomes demonstrate that the model outperforms leading architectures like StripedHyena2 in scaling and efficiency, while also showing superior zero-shot performance in predicting protein variant fitness and gene essentiality.

Introduction

Genomic foundation models are essential for decoding DNA syntax to advance fields like drug discovery and synthetic biology. However, current approaches face a fundamental tradeoff between computational efficiency and biological accuracy. Fixed-vocabulary tokenizers are efficient but often fragment critical biological motifs like codons, while nucleotide-level models preserve biological coherence but suffer from prohibitive computational costs when processing long genomic contexts. The authors leverage a hierarchical, tokenizer-free architecture called dnaHNet to resolve this tension. By using a differentiable dynamic chunking mechanism, the model adaptively compresses raw nucleotides into latent tokens, allowing it to achieve superior scaling, faster inference, and state-of-the-art performance on zero-shot tasks such as protein variant fitness prediction.

Dataset

The authors evaluate dnaHNet using three distinct datasets designed to test different biological modeling capabilities:

Protein Variant Effects (MaveDB): This subset consists of 21,250 nucleotide-level data points compiled from 12 experimental fitness datasets for E. coli K-12. It is used to assess the model's ability to capture local coding syntax and predict protein fitness landscapes.
Gene Essentiality (DEG): Comprising 185,226 data points, this dataset was constructed by generating binary essentiality labels for genes across 62 bacterial organisms from the Database of Essential Genes. The authors sourced base sequences and annotations from NCBI and labeled genes as essential if they matched DEG entries by name or sequence identity greater than 99 percent. This subset evaluates the model's capacity to integrate genomic context and long-range dependencies.
Genomic Structure Interpretation (NCBI): To perform interpretability analysis, the authors used the B. subtilis genome and functional annotations from NCBI. They partitioned the genome into distinct functional regions based on these annotations to determine how the model's segmentation aligns with known biological structures.

Method

The authors introduce dnaHNet, a scalable, tokenizer-free foundation model designed for genomic sequence learning. The model formulates genomic learning as an autoregressive sequence modeling problem. Given a nucleotide sequence $X = (x_1, \ldots, x_L)$ where $x_t \in \{A, C, G, T\}$ , the objective is to model the probability distribution $P(X) = \prod_{t=1}^{L} P(x_t|x_{<t})$ . To handle long genomic contexts efficiently, dnaHNet utilizes a recursive hierarchical architecture consisting of three primary differentiable modules: an Encoder ( $\mathcal{E}$ ), a Main Network ( $\mathcal{M}$ ), and a Decoder ( $\mathcal{D}$ ).

Refer to the framework diagram:

The Encoder is responsible for compressing nucleotide-level inputs into latent chunks through a dynamic segmentation mechanism. It employs a hybrid backbone consisting of four Mamba layers and one Transformer layer. The Encoder transforms input embeddings into hidden states $\mathbf{h}_{1:L} \in \mathbb{R}^{L \times D}$ . To determine segmentation boundaries, a boundary prediction module computes probabilities $p_t \in [0, 1]$ using the following formulation: $\np_t = \frac{1}{2} (1 - \mathrm{CosineSim}(W_q \mathbf{h}_t, W_k \mathbf{h}_{t-1}))$ where $W_q$ and $W_k$ are learnable projection matrices. High boundary probabilities are assigned to nucleotides with dissimilar representations, which encourages the model to segment at biologically significant transitions, such as codon boundaries. The Chunking layer then downsamples the output by selecting representations at these predicted boundaries, resulting in a compressed sequence $E = (\mathbf{e}_{1}, \ldots, \mathbf{e}_{L'})$ where $L' \leq L$ .

As shown in the figure below:

The compressed sequence $E$ is then processed by the Main Network $\mathcal{M}$ . This module can be a standard Transformer or another H-Net module, which allows for recursive chunking to capture multiple levels of abstraction. In a two-stage hierarchy, the first stage captures high-frequency local patterns like codon periodicity, while the second stage models long-range dependencies across functional regions. The Main Network outputs processed latent states $\hat{E} = (\hat{\mathbf{e}}_1, \ldots, \hat{\mathbf{e}}_{L'}) \in \mathbb{R}^{L' \times D}$ .

The Decoder $\mathcal{D}$ maps these latent states back to the original nucleotide resolution through a two-step process. First, a smoothing module refines the latent states into smoothed representations $\bar{E} = (\bar{\mathbf{e}}_1, \ldots, \bar{\mathbf{e}}_{L'})$ using a recurrence that interpolates discrete chunks: $\bar{\mathbf{e}}_j = P_j \hat{\mathbf{e}}_j + (1 - P_j) \bar{\mathbf{e}}_{j-1}$ where $P_j$ is the boundary probability for the $j$ -th chunk. Second, an upsampler expands these smoothed latents to the original length $L$ by copying the vector $\bar{\mathbf{e}}_{c(t)}$ to every nucleotide position $t$ corresponding to the chunk index $c(t)$ . The Decoder then utilizes four Mamba layers and one Transformer layer to model autoregressive dependencies, with a linear head projecting the output to the nucleotide vocabulary logits to produce the next-nucleotide distribution.

The training process is conducted end-to-end using a composite objective. The primary component is the autoregressive next-token prediction loss: $\mathcal{L}_{\mathrm{NLL}} = - \sum_{t=1}^{L} \log P_{\theta}(x_t | \mathbf{x}_{<t})$ To prevent degenerate segmentation during training, the authors incorporate a ratio loss that regularizes the dynamic chunking toward a target downsampling ratio $R_s$ for each stage $s$ : $\mathcal{L}_{\mathrm{rate}}^{(s)} = \frac{R_s}{R_s - 1} \left( (R_s - 1) F_s G_s + (1 - F_s)(1 - G_s) \right)$ where $F_s$ is the actual fraction of selected chunks and $G_s$ is the average boundary probability. The total loss is defined as $\mathcal{L} = \mathcal{L}_{\text{NLL}} + \alpha \sum_{s} \mathcal{L}_{\text{rate}}^{(s)}$ , where $\alpha$ is a regularization coefficient.

Experiment

The experiments compare dnaHNet against StripedHyena2 and Transformer++ architectures through scaling law analyses, zero-shot biological prediction tasks, and structural interpretability studies. Results demonstrate that dnaHNet achieves superior compute and inference efficiency, consistently outperforming baselines in perplexity scaling and downstream performance on protein variant effect and gene essentiality predictions. Furthermore, the model demonstrates an emergent ability to learn biological hierarchies without supervision, effectively discovering codon structures and functional genomic regions through its hierarchical compression mechanism.

The authors analyze the learned hierarchical chunking boundaries of the dnaHNet model across different stages. The results demonstrate that the first stage captures local triplet codon structure, while the second stage identifies broader functional genomic organization. The first stage shows strong periodicity within coding regions, with selection rates varying significantly by codon position. The second stage exhibits higher selection rates for functional regions like promoters and intergenic areas compared to coding regions. The overall global selection rate increases from the first stage to the second stage.

The authors evaluate the zero-shot protein variant effect prediction performance of dnaHNet against StripedHyena2 and Transformer baselines across various training compute budgets. Results show that dnaHNet consistently achieves higher Spearman correlation for predicting experimental fitness as training FLOPs increase. dnaHNet demonstrates superior predictive accuracy compared to both StripedHyena2 and Transformer architectures. The predictive performance of dnaHNet improves steadily as the amount of training compute increases. The performance gap between dnaHNet and the baseline models widens at higher training compute scales.

The authors compare the gene essentiality prediction performance of two dnaHNet hierarchical configurations against a StripedHyena2 baseline across various compute budgets. Results show that both dnaHNet architectures consistently outperform the baseline in AUROC as training compute increases. Both dnaHNet hierarchy versions demonstrate improved predictive accuracy as training FLOPs increase. The dnaHNet (3,2) hierarchy achieves higher AUROC scores compared to the (2,2) hierarchy at matched compute levels. dnaHNet models maintain a performance advantage over StripedHyena2 across the tested compute range.

The researchers evaluate the dnaHNet model by analyzing its hierarchical chunking boundaries and testing its predictive capabilities for protein variant effects and gene essentiality. The findings reveal that the model successfully captures both local codon structures and broader functional genomic organization through its multi-stage hierarchy. Across various compute budgets, dnaHNet consistently outperforms StripedHyena2 and Transformer baselines, demonstrating superior scaling and predictive accuracy in both protein fitness and gene essentiality tasks.

Quell-PDF

Inhaltsverzeichnis

KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren

Sofort einsatzbereite GPUs

Die besten Preise

Erste Schritte Preise anzeigen

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates

Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen

Unterstützt von MailChimp

vor 10 Stunden

Arnav Shah Junzhe Li Parsa Idehpour Adibvafa Fallahpour Brandon Wang Sukjun Hwang Bo Wang Patrick D. Hsu Hani Goodarzi Albert Gu

Inhaltsverzeichnis

Zusammenfassung

One-sentence Summary

Key Contributions

The paper introduces dnaHNet, a tokenizer-free autoregressive model that utilizes a differentiable dynamic chunking mechanism to adaptively compress raw nucleotides into latent tokens.
This architecture achieves significant computational efficiency by employing recursive chunking to reduce FLOPs quadratically, resulting in over 3x faster inference speeds compared to Transformer models.
Experiments on prokaryotic genomes demonstrate that the model outperforms leading architectures like StripedHyena2 in scaling and efficiency, while also showing superior zero-shot performance in predicting protein variant fitness and gene essentiality.

Introduction

Dataset

The authors evaluate dnaHNet using three distinct datasets designed to test different biological modeling capabilities:

Protein Variant Effects (MaveDB): This subset consists of 21,250 nucleotide-level data points compiled from 12 experimental fitness datasets for E. coli K-12. It is used to assess the model's ability to capture local coding syntax and predict protein fitness landscapes.
Gene Essentiality (DEG): Comprising 185,226 data points, this dataset was constructed by generating binary essentiality labels for genes across 62 bacterial organisms from the Database of Essential Genes. The authors sourced base sequences and annotations from NCBI and labeled genes as essential if they matched DEG entries by name or sequence identity greater than 99 percent. This subset evaluates the model's capacity to integrate genomic context and long-range dependencies.
Genomic Structure Interpretation (NCBI): To perform interpretability analysis, the authors used the B. subtilis genome and functional annotations from NCBI. They partitioned the genome into distinct functional regions based on these annotations to determine how the model's segmentation aligns with known biological structures.

Method

Refer to the framework diagram:

As shown in the figure below:

Experiment

Quell-PDF

Inhaltsverzeichnis

KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren

Sofort einsatzbereite GPUs

Die besten Preise

Erste Schritte Preise anzeigen

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates

Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen

Unterstützt von MailChimp

Command Palette

dnaHNet: Ein skalierbares und hierarchisches Foundation Model für das Lernen genomischer Sequenzen

Arnav Shah Junzhe Li Parsa Idehpour Adibvafa Fallahpour Brandon Wang Sukjun Hwang Bo Wang Patrick D. Hsu Hani Goodarzi Albert Gu

Zusammenfassung

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

KI mit KI entwickeln

HyperAI Newsletters

Command Palette

dnaHNet: Ein skalierbares und hierarchisches Foundation Model für das Lernen genomischer Sequenzen

Arnav Shah Junzhe Li Parsa Idehpour Adibvafa Fallahpour Brandon Wang Sukjun Hwang Bo Wang Patrick D. Hsu Hani Goodarzi Albert Gu

Zusammenfassung

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

KI mit KI entwickeln

HyperAI Newsletters

Command Palette

dnaHNet: Ein skalierbares und hierarchisches Foundation Model für das Lernen genomischer Sequenzen

Arnav Shah Junzhe Li Parsa Idehpour Adibvafa Fallahpour Brandon Wang Sukjun Hwang Bo Wang Patrick D. Hsu Hani Goodarzi Albert Gu

Zusammenfassung

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

KI mit KI entwickeln

HyperAI Newsletters