AIR Flash News | BioMedGPT-10B: The World's First Open-Source, Commercially Usable Billion-Parameter Multimodal Biomedical Model - Tsinghua University Institute for AI Industry Research
**Abstract:** The Institute for AI Research (AIR) at Tsinghua University, in collaboration with Water Molecular, has launched the world's first open-source, commercializable, large-scale multimodal biomedical model, BioMedGPT-10B, which boasts 10 billion parameters. This model is designed to excel in biomedical professional domain question-answering, achieving state-of-the-art (SOTA) performance that rivals human expert levels. Additionally, the BioMedGPT-LM-7B, a specialized version of the Llama 2 large language model, has been made freely available for commercial use, tailored specifically for the biomedical sector. The collaboration also includes the open-source release of DrugFM, a foundational model for small molecule drugs, which aims to enhance research and application in drug discovery and development. **Motivation and Framework:** The primary motivation behind the development of BioMedGPT is to bridge the gap between natural language and the complex coding languages of chemistry and biology. According to Professor Zaiqing Nie, Chief Researcher at AIR and Chief Scientist at Water Molecular, the essence of life phenomena can be seen as a form of naturally evolved coding. By integrating human knowledge with amino acid, molecular, and protein data within a unified large model framework, the team aims to understand the mechanisms of biological coding and drive research and applications in life sciences from the ground up. The BioMedGPT framework leverages a pre-trained large language model, BioMedGPT-LM, as a bridge to connect natural language, biological coding languages, and chemical molecular languages. This model is fine-tuned using extensive biomedical data, allowing it to perform exceptionally well in the biomedical domain. BioMedGPT-LM integrates various modalities of biological encoding, including molecular, protein, cellular, and gene expression data, along with knowledge graphs, documents, numerical experimental results, and other specialized knowledge. A cross-modal feature fusion module ensures that different modalities of biological and chemical coding languages, as well as natural language, can be uniformly integrated into a shared feature space. **BioMedGPT-10B:** BioMedGPT-10B, a concrete implementation of the BioMedGPT framework, establishes a unified feature space for text, molecules, and proteins. It supports interactive question-answering across these modalities, making it highly applicable in tasks such as drug target exploration, lead compound design and optimization, and protein design. The model has been tested on several benchmark datasets, demonstrating its superior language understanding capabilities in the biomedical field. For instance, on the PubMedQA dataset, BioMedGPT-10B achieved an accuracy of 76.1%, just 1.9% lower than the human expert standard. In out-of-domain (OOD) settings, BioMedGPT-10B outperformed human artificial performance, achieving an accuracy of 50.4%, the only model to do so apart from ChatGPT, which has over 17 times more parameters. **Cross-Modal QA Tasks:** 1. **Molecular Natural Language Cross-Modal QA:** This task involves generating a natural language description of a given molecular formula and supporting further interactive questioning to explore related information. BioMedGPT-10B was evaluated using the ChEBI-20 dataset, a classic molecular text generation task, and it outperformed general-purpose language models in this domain. 2. **Protein Natural Language Cross-Modal QA:** This task generates a natural language description of a given protein sequence and supports further questioning to aid in drug target discovery and research. Experiments using the UniProt QA dataset showed that BioMedGPT-10B significantly outperformed other models, including the unaligned LLama2-7B-chat, which could not understand the input protein data or provide accurate and informative answers. In contrast, BioMedGPT-10B provided precise and comprehensive responses, highlighting the protein P52341's role in thymidine nucleotide biosynthesis. **BioMedGPT-LM:** The language model component of BioMedGPT, BioMedGPT-LM, is trained on a large corpus of biomedical literature. It excels in biomedical language tasks and has achieved leading performance on three benchmark datasets: USMLE, MedMCQA, and PubMedQA. This model's proficiency in professional biomedical question-answering is comparable to that of medical experts, and it has successfully passed the United States Medical Licensing Examination (USMLE). **MolFM/DrugFM:** Alongside BioMedGPT-10B, the team has released MolFM/DrugFM, foundational models for small molecule drugs. MolFM, developed by Professor Nie's team at AIR, is the first model to uniformly represent molecular structures, biomedical literature, and knowledge bases. It incorporates a cross-modal attention mechanism that connects atoms and molecular entities with their neighbors and related semantic texts. By minimizing the distance between different modalities of the same molecule and molecules with similar structures or functions in the feature space, MolFM captures both local and global molecular knowledge, enhancing cross-modal understanding. MolFM's effectiveness has been validated in various downstream tasks, including cross-modal retrieval, molecular description, molecule-text generation, and molecular property prediction. DrugFM, developed in collaboration with the Tsinghua AIR-Zhiyuan Joint Research Center for Health Computing, builds on the UniMAP pre-training model for small molecule drugs. This model is designed to advance the core organizational principles and data representation of small molecule drugs. DrugFM combines the UniMAP pre-training model with the existing large multimodal foundational model MolFM, achieving SOTA performance in cross-modal retrieval tasks. As a foundational research model for small molecule drugs, DrugFM will continue to evolve, supporting and improving downstream tasks such as drug screening, design, and optimization. **Water Molecular and Tsinghua AIR-Zhiyuan Joint Research Center:** Water Molecular, a spin-off from the Institute for AI Research at Tsinghua University, is led by Professor Guoqiang Zhang and Professor Zaiqing Nie. The company focuses on developing foundational models for the biomedical industry and creating next-generation conversational biomedical research assistants. Its products aim to serve various stages of drug development, including early research planning, target discovery, molecular design optimization, clinical trial design, and drug repositioning. The Tsinghua AIR-Zhiyuan Joint Research Center for Health Computing, established in August 2021, is a partnership between Tsinghua University's Institute for AI Research and the Beijing Academy of Artificial Intelligence. The center is dedicated to advancing AI technology to transform life health fields from isolated and open-loop systems into collaborative and closed-loop systems. Research collaborations focus on areas such as intelligent biocomputation, intelligent new drug development, and proactive health management. **Conclusion:** The open-source release of BioMedGPT-10B and DrugFM represents a significant step forward in the integration of AI and biomedicine. These models, with their advanced cross-modal capabilities, are poised to accelerate research and development in drug discovery, protein design, and other biomedical applications. By providing a robust and commercializable foundation, the models aim to support the biomedical community in achieving more accurate and comprehensive insights, ultimately driving innovation and improving patient outcomes.
