HyperAIHyperAI

Command Palette

Search for a command to run...

MolHIT : Progresser dans la génération de graphes moléculaires grâce à des modèles de diffusion discrète hiérarchiques

Hojung Jung Rodrigo Hormazabal Jaehyeong Jo Youngrok Park Kyunggeun Roh Se-Young Yun Sehui Han Dae-Woong Jeong

Résumé

La génération moléculaire à l’aide de modèles de diffusion s’est imposée comme une voie prometteuse pour la découverte de médicaments pilotée par l’intelligence artificielle et la science des matériaux. Bien que les modèles de diffusion sur graphes soient largement adoptés en raison de la nature discrète des graphes moléculaires en 2D, les méthodes existantes souffrent d’une validité chimique faible et peinent à atteindre les propriétés souhaitées par rapport aux approches 1D. Dans ce travail, nous introduisons MolHIT, un cadre puissant pour la génération de graphes moléculaires, qui surmonte les limitations de performance persistantes des méthodes existantes. MolHIT repose sur un modèle de diffusion discret hiérarchique, qui généralise la diffusion discrète à des catégories supplémentaires encodant des connaissances chimiques a priori, ainsi que sur une encodage atomique découplé, qui sépare les types d’atomes selon leurs rôles chimiques. Globalement, MolHIT atteint un nouveau record d’état de l’art sur le jeu de données MOSES, obtenant pour la première fois une validité quasi parfaite dans le cadre de la diffusion sur graphes, dépassant ainsi des bases 1D performantes sur plusieurs métriques. Nous démontrons également une excellente performance sur des tâches en aval, notamment la génération guidée par plusieurs propriétés et l’extension de squelettes.

One-sentence Summary

Researchers from KAIST AI, LG AI Research, and Seoul National University propose MolHIT, a hierarchical discrete diffusion model that enhances molecular graph generation by encoding chemical priors and decoupling atom roles, achieving near-perfect validity and SOTA performance on MOSES, with strong downstream applications in property-guided design and scaffold extension.

Key Contributions

  • MolHIT introduces a Hierarchical Discrete Diffusion Model that encodes chemical priors through coarse-to-fine state transitions, addressing the low validity of prior graph diffusion models while preserving structural novelty.
  • It proposes Decoupled Atom Encoding to split atom types by chemical roles like aromaticity and charge, resolving information loss in naive encodings and improving reconstruction and generation reliability.
  • Evaluated on MOSES and other benchmarks, MolHIT achieves near-perfect validity and state-of-the-art performance across unconditional and conditional tasks, outperforming both 1D and existing 2D baselines.

Introduction

The authors leverage hierarchical discrete diffusion and chemically aware atom encoding to tackle molecular graph generation, where prior models struggle to balance validity and novelty. Existing graph diffusion approaches treat atoms as independent categories and use naive encodings that ignore chemical roles like aromaticity or charge, leading to invalid or unrealistic structures. MolHIT introduces a two-stage diffusion process that first learns coarse chemical identities before refining them, paired with Decoupled Atom Encoding that explicitly separates atom types by their functional roles—resulting in near-perfect validity on MOSES and outperforming both 1D and 2D baselines across benchmarks and downstream tasks like scaffold extension and multi-property generation.

Dataset

  • The authors use two large molecular datasets: MOSES (1.9M molecules, 7 heavy atom types) and Guacamol (12 heavy atom types).
  • Both datasets are processed via DAE: MOSES is augmented into 12 tokens, Guacamol is decoupled into 56 tokens.
  • The model architecture follows DiGress, using the same graph transformer size.
  • All results are averaged over three independent runs, with standard deviations in Appendix D.3.

Method

The authors leverage a Hierarchical Discrete Diffusion Model (HDDM) to generalize the standard discrete diffusion framework into a multi-stage corruption process, enabling more structured and chemically meaningful denoising for molecular graph generation. The core innovation lies in augmenting the state space with mid-level semantic categories that bridge the clean atom types and a final masked state, allowing the model to progressively refine predictions from broad chemical classes to specific atomic identities.

As shown in the framework diagram, the HDDM operates over a three-tiered state space: the clean state set S0\mathcal{S}_0S0, a set of mid-level states S1\mathcal{S}_1S1, and a single masked state S2={m}\mathcal{S}_2 = \{ \mathbf{m} \}S2={m}. The forward diffusion process is governed by a sequence of transition matrices that progressively map clean atoms into their semantic groups (e.g., halogens, aromatics, chalcogens, aliphatics) before ultimately transitioning to the masked state. This hierarchical structure is encoded via a block-structured transition kernel Q(1)Q^{(1)}Q(1) that maps clean states to mid-level states, and Q(2)Q^{(2)}Q(2) that maps all non-masked states to the mask. The cumulative forward transition at timestep ttt is defined as:

Qt=αtI+(βtαt)Q(1)+(1βt)Q(2),Q_t = \alpha_t \mathbf{I} + (\beta_t - \alpha_t) Q^{(1)} + (1 - \beta_t) Q^{(2)},Qt=αtI+(βtαt)Q(1)+(1βt)Q(2),

where αt\alpha_tαt and βt\beta_tβt are monotonically decreasing diffusion schedules satisfying α0=β0=1\alpha_0 = \beta_0 = 1α0=β0=1 and αT=βT=0\alpha_T = \beta_T = 0αT=βT=0, with αtβt\alpha_t \leq \beta_tαtβt. This formulation ensures Chapman–Kolmogorov consistency, enabling tractable multi-step transitions.

For molecular graph generation, the authors decouple the diffusion process for atoms and bonds. Atoms are perturbed via the HDDM process, while bonds follow a uniform transition kernel to ensure structural diversity. The forward dynamics are thus:

QX,t=αX,tI+(βX,tαX,t)QX,t(1)+(1βX,t)QX,t(2),QE,t=αE,tI+(1αE,t)1dE1dET.\begin{array}{rl} Q_{X,t} = \alpha_{X,t} \mathbf{I} + (\beta_{X,t} - \alpha_{X,t}) Q_{X,t}^{(1)} + (1 - \beta_{X,t}) Q_{X,t}^{(2)}, \\ Q_{E,t} = \alpha_{E,t} \mathbf{I} + (1 - \alpha_{E,t}) \mathbf{1}_{d_E} \mathbf{1}_{d_E}^T. \end{array}QX,t=αX,tI+(βX,tαX,t)QX,t(1)+(1βX,t)QX,t(2),QE,t=αE,tI+(1αE,t)1dE1dET.

The model is trained to predict the clean graph G0=(X0,E0)G_0 = (X_0, E_0)G0=(X0,E0) from a noisy graph GtG_tGt, using a cross-entropy loss that independently optimizes atom and bond predictions:

Lθ=Et,Gtq(G0)[i=1nlogpθX(X0,iGt,t)+λ1i<jnlogpθE(E0,ijGt,t)],\mathcal{L}_\theta = \mathbb{E}_{t, G_t \sim q(\cdot|G_0)} \left[ \sum_{i=1}^n -\log p_\theta^X(X_{0,i} | G_t, t) + \lambda \sum_{1 \leq i < j \leq n} -\log p_\theta^E(E_{0,ij} | G_t, t) \right],Lθ=Et,Gtq(G0)[i=1nlogpθX(X0,iGt,t)+λ1i<jnlogpθE(E0,ijGt,t)],

where λ\lambdaλ balances node and edge contributions.

To enhance chemical fidelity, the authors introduce Decoupled Atom Encoding (DAE), which expands the atom vocabulary by explicitly encoding aromaticity, hydrogen saturation, and formal charge as distinct token states. This resolves structural ambiguities present in coarse encodings (e.g., distinguishing [n] from [nH]) and enables near-perfect reconstruction of complex motifs like heteroaromatics and zwitterions. As illustrated in the figure, DAE allows MolHIT to generate molecules with formal charges at proportions matching the training distribution, a capability absent in prior models.

Sampling is performed via a Project-and-Noise (PN) sampler, which projects the model’s denoised prediction onto the discrete manifold via categorical sampling and then re-noises it to the previous timestep using the forward kernel. This bypasses posterior constraints and encourages structural diversity. Temperature and top-ppp sampling are applied selectively to atom predictions to control the quality-diversity trade-off. The overall generation process, depicted in the figure, shows a molecule evolving from a fully masked state at t=Tt=Tt=T, through mid-level semantic states at t=T/2t=T/2t=T/2, to a fully reconstructed structure at t=0t=0t=0, guided by the hierarchical transition probabilities.

Experiment

  • MolHIT achieves state-of-the-art performance on MOSES across key metrics including Quality, Validity, FCD, and Scaffold Novelty, demonstrating strong navigation of the drug-like chemical manifold while exploring novel structures.
  • On GuacaMol, MolHIT outperforms baselines across most metrics despite using the full unfiltered dataset, showing robustness to charged and complex molecules; performance gaps in FCD are attributed to modeling challenges with extended atom vocabularies.
  • In multi-property guided generation, MolHIT significantly improves conditioning precision (52.4% lower MAE) and reliability (Pearson r up to 0.950) without sacrificing validity, confirming effective control over chemical properties like QED, SA, MW, and logP.
  • For scaffold extension, MolHIT surpasses DiGress in validity, diversity, and hit rates, indicating superior ability to generate chemically plausible and structurally diverse extensions while preserving fixed scaffolds.
  • Ablation studies confirm that DAE, PN Sampler, and HDDM each contribute meaningfully to overall performance, while temperature sampling reveals a trade-off between quality and novelty, with optimal settings yielding near-perfect validity and high quality.

The authors use MolHIT to generate molecules on the MOSES benchmark and compare it against both 1D sequence and 2D graph baselines. Results show MolHIT achieves the highest Quality and Scaffold Novelty while maintaining near-perfect validity, outperforming prior models in balancing structural innovation with chemical feasibility. The model also demonstrates strong distributional fidelity, as reflected in high Scaffold Retrieval and SNN scores, indicating it effectively captures the underlying drug-like chemical space without overfitting.

The authors use a full unfiltered GuacaMol dataset to evaluate MolHIT, contrasting with prior models trained on filtered subsets. Results show MolHIT achieves the highest validity and scaffold novelty while maintaining strong distributional fidelity, outperforming DiGress variants across most metrics despite using fewer training epochs. The model demonstrates robustness in handling charged atoms and broader chemical space without sacrificing structural quality.

The authors use the MOSES dataset to evaluate multi-property guided generation, conditioning models on QED, SA, logP, and MW. Results show MolHIT achieves high precision in matching target properties with low MAE and strong Pearson correlation, while maintaining validity above 95%. This indicates the model effectively balances property control with structural feasibility.

The authors use an ablation study to show that integrating decoupled atom encoding, the PN sampler, and HDDM into DiGress progressively improves molecular generation quality, validity, and distributional fidelity. Results show that MolHIT achieves the highest Quality and near-perfect Validity while maintaining competitive FCD, indicating effective navigation of the drug-like chemical space. Each component contributes meaningfully, with the full MolHIT configuration outperforming all intermediate variants.

The authors evaluate MolHIT on scaffold extension tasks using the MOSES dataset, comparing it against DiGress and a marginal transition baseline with decoupled atom encoding. Results show MolHIT achieves significantly higher validity and Hit@1 and Hit@5 scores, indicating stronger capability to recover ground-truth molecular extensions while maintaining structural diversity. The improvements suggest MolHIT better balances fidelity to fixed scaffolds with exploration of valid chemical space.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp