HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into
  Multimodal LLMs at Scale

Abstract

The rapid development of multimodal large language models (MLLMs), such asGPT-4V, has led to significant advancements. However, these models still facechallenges in medical multimodal capabilities due to limitations in thequantity and quality of medical vision-text data, stemming from data privacyconcerns and high annotation costs. While pioneering approaches utilizePubMed's large-scale, de-identified medical image-text pairs to address theselimitations, they still fall short due to inherent data noise. To tackle this,we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) inan 'unblinded' capacity to denoise and reformat the data, resulting in thecreation of the PubMedVision dataset with 1.3 million medical VQA samples. Ourvalidation demonstrates that: (1) PubMedVision can significantly enhance themedical multimodal capabilities of current MLLMs, showing significantimprovement in benchmarks including the MMMU Health & Medicine track; (2)manual checks by medical experts and empirical results validate the superiordata quality of our dataset compared to other data construction methods. UsingPubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which showssuperior performance in medical multimodal scenarios among open-source MLLMs.

Code Repositories

freedomintelligence/huatuogpt-vision
Official
pytorch
Mentioned in GitHub

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp