2 months ago

Property Enhanced Instruction Tuning for Multi-task Molecule Generation with Large Language Models

Lin, Xuan ; Chen, Long ; Wang, Yile ; Zeng, Xiangxiang ; Yu, Philip S.

Abstract

Large language models (LLMs) are widely applied in various natural languageprocessing tasks such as question answering and machine translation. However,due to the lack of labeled data and the difficulty of manual annotation forbiochemical properties, the performance for molecule generation tasks is stilllimited, especially for tasks involving multi-properties constraints. In thiswork, we present a two-step framework PEIT (Property Enhanced InstructionTuning) to improve LLMs for molecular-related tasks. In the first step, we usetextual descriptions, SMILES, and biochemical properties as multimodal inputsto pre-train a model called PEIT-GEN, by aligning multi-modal representationsto synthesize instruction data. In the second step, we fine-tune existingopen-source LLMs with the synthesized data, the resulting PEIT-LLM can handlemolecule captioning, text-based molecule generation, molecular propertyprediction, and our newly proposed multi-constraint molecule generation tasks.Experimental results show that our pre-trained PEIT-GEN outperforms MolT5 andBioT5 in molecule captioning, demonstrating modalities align well betweentextual descriptions, structures, and biochemical properties. Furthermore,PEIT-LLM shows promising improvements in multi-task molecule generation,proving the scalability of the PEIT framework for various molecular tasks. Werelease the code, constructed instruction data, and model checkpoints inhttps://github.com/chenlong164/PEIT.