MADGEN: Mass-Spec attends to De Novo Molecular generation

The annotation (assigning structural chemical identities) of MS/MS spectraremains a significant challenge due to the enormous molecular diversity inbiological samples and the limited scope of reference databases. Currently, thevast majority of spectral measurements remain in the "dark chemical space"without structural annotations. To improve annotation, we propose MADGEN(Mass-spec Attends to De Novo Molecular GENeration), a scaffold-based methodfor de novo molecular structure generation guided by mass spectrometry data.MADGEN operates in two stages: scaffold retrieval and spectra-conditionedmolecular generation starting with the scaffold. In the first stage, given anMS/MS spectrum, we formulate scaffold retrieval as a ranking problem and employcontrastive learning to align mass spectra with candidate molecular scaffolds.In the second stage, starting from the retrieved scaffold, we employ the MS/MSspectrum to guide an attention-based generative model to generate the finalmolecule. Our approach constrains the molecular generation search space,reducing its complexity and improving generation accuracy. We evaluate MADGENon three datasets (NIST23, CANOPUS, and MassSpecGym) and evaluate MADGEN'sperformance with a predictive scaffold retriever and with an oracle retriever.We demonstrate the effectiveness of using attention to integrate spectralinformation throughout the generation process to achieve strong results withthe oracle retriever.