HyperAI

Selected for ECCV 2024! Zhejiang University and Microsoft Research Asia Jointly Proposed the Unified Medical Image Pre-training Framework UniMedI, Breaking the Barriers of Medical Data Heterogeneity

特色图像

AI researchers are constantly striving to make AI have human-like reaction capabilities under certain conditions, so that they can replace humans to efficiently perform specific tasks. As in the intersection of medical imaging and artificial intelligence, deep models based on visual language pre-training (VLP) can be pre-trained on a large number of image and corresponding text data sets due to their automation characteristics, and learn to automatically extract relevant features from new images, which can efficiently solve the time-consuming and labor-intensive manual annotation needs.

However, although VLP has achieved a certain degree of success in the medical field, it still faces many challenges in further expanding the data scale of its application.

First, most existing model training is based on single-modal data (mainly 2D images, such as X-rays), which is inconsistent with real medical scenarios that include multimodal images (including 2D and 3D images, such as CT and MRI images). Secondly, the inherent heterogeneity of medical images of different modalities also hinders their effective collaboration and integration. In addition, data of different modalities of medical images also have dimensional differences and lack of paired data. Therefore,How to build a unified model and effectively map these different modal data into a common space to achieve joint learning has become an extremely challenging topic.

In order to solve the above problems,Hu Haoji's team from Zhejiang University and Qiu Lili's team from Microsoft Research Asia proposed a new unified medical image pre-training framework, UniMedI.It uses diagnostic reports as a common semantic space to create a unified representation for medical images of different modalities. In addition, it also introduces the technology of creating "pseudo-pairs". Under the guidance of text,UniMedI is able to select 2D slices related to the text from complex 3D images, which act as pseudo pairs bridging 2D and 3D data, enhancing the consistency between various medical imaging modalities and effectively integrating medical multimodal images.

The relevant research results are titled "Unified Medical Image Pre-training in Language-Guided Common Semantic Space" and included in ECCV 2024, the top conference in the field of computer vision and machine learning.

For more information about the summit, please click the link below:

https://go.hyper.ai/0wtVi

Research highlights:
* In experiments, UniMedI has demonstrated excellent performance on 2D and 3D images on multiple different datasets, and has excelled in a wide range of medical tasks such as image classification, segmentation, and retrieval 

* UniMedI can collect 2D and 3D images in a unified way, solving the data scarcity problem in the medical field


Paper address:
https://eccv.ecva.net/virtual/2024/poster/1165
Follow the official account and reply "Medical Image Pre-training Framework" to get the complete PDF

The open source project "awesome-ai4s" brings together more than 100 AI4S paper interpretations and provides massive data sets and tools:

https://github.com/hyperai/awesome-ai4s

Real medical data, effective verification framework

The data used for pre-training the UniMedI framework comes from the JPG version of the 2D X-ray dataset MIMIC-CXR 2.0.0 and the 3D CT scan dataset BIMCV.

Among them, the researchers preprocessed the 2D dataset and eliminated all side images to align with downstream tasks that only use front images. At the same time, to maintain the integrity of the dataset, short reports of 2D and 3D datasets with less than 3 sentences were not used in the experiment.

In terms of images, the size of 2D images is 224 × 224, and the size of 3D images is 128 × 128 × 32.

The research team pre-trained the UniMedI framework 50 times on 8 Tesla V100 GPUs with a batch size of 144.

In the experimental evaluation, the team first performed medical image classification on 2D and 3D datasets.There are three representative 2D datasets: CheXpert, which contains 191,229 frontal chest radiographs; RSNA pneumonia stage 2 version, which contains approximately 29,700 frontal chest radiographs; and 16,490 positive COVID-19 images from more than 2,800 patients.

The team then classified two representative 3D datasets:They are CC-CCII and LUNA 16. CC-CCII uses the Clean-CC-CCII version, which contains 340,190 slices from 3,993 scans of 2,698 patients; LUNA 16, which is based on LIDC-IDRI, contains 888 CT scans with annotations. The experiment deleted CT scans with slice thickness greater than 3mm in the LIDC-IDRI database.

Layered collaboration mechanism breaks down data barriers

UniMedI proposed in this study is a vision-language pre-training framework. Medical images and their text reports are encoded by two encoders, the vision encoder and the text encoder, respectively, and then jointly learned through VL (Vision-Language) contrastive learning. UniMedI is unique in that it can efficiently acquire 2D and 3D images in a unified way, solving the data scarcity problem in the medical field. The overall framework of UniMedI is shown on the left side of the figure below:

UniMedI overall framework: the left side is the overall process, the right side is the key design

In the experiment, the visual encoder used is ViT-B/16, which mainly extracts representations in the common feature space of 2D and 3D visual data. The text encoder uses BioClinicalBERT to encode text features. The visual encoder and text encoder are universal in 2D and 3D data.

To overcome the challenge of non-existence of paired 2D and 3D image data.The research team introduced a method to create "pseudo-pairing" in UniMedI, which is designed based on a novel language-guided attention slice selection strategy.

For example, when the input is a 3D image, a portion of the 2D slices most relevant to the report is extracted from it, and then the selected slices are regarded as 2D images, thereby forming a pseudo pairing relationship between 2D and 3D images. After that, by inputting the selected 2D slices into the network together with the original 3D image, the relationship between them and the report can be learned together, and finally a unified feature space is formed. When the input is a 2D image, the slice selection process is omitted.

Afterwards, a visual encoder maps all multimodal images (including original 2D and 3D images and selected 2D slices) into the representation space. The visual encoder has labelers T for 2D and 3D images respectively.2D and T3D, and a shared backbone E for better integrationv The model consisting of a visual encoder and a text encoder Eₗ is learned end-to-end in a VLP via a contrastive learning loss Lᵥₗ. In this process, both 2D and 3D images can be encoded into a common semantic space supervised by the language information in the report.

In order to make full use of the multimodal data of medical images themselves and some shared public information, this study also introduced an auxiliary task design, namely masking and restoration, and used the self-distillation method to complete the task.This allows tokens of 2D and 3D images to communicate with each other and enhances cross-dimensional interactions and integration of multimodal images.

It is worth noting that one of the highlights of UniMedI is the synergistic effect of the attention slice selection strategy and VL contrastive learning.

* on the one hand,VL contrastive learning enables language supervision, which is directly applied to the visual CLS token. The token contains important information in the report, so the attention weight of the visual CLS token, as the basis for 2D slice selection, will carry the supervision information from the report and construct a joint feature space together with the 3D features.

* on the other hand,Careful slice selection makes the 2D and 3D feature spaces more integrated, even without paired data. This common space can amplify the detailed information between medical images and reports, and in this way, promote the alignment between images and reports. These two designs combine the representations of multimodal images and make them close to the report representation space at the same time, achieving the effect of one plus one being greater than two in building a common semantic space.

Multi-angle experimental evaluation shows that the performance surpasses UniMiss

In order to conduct a comprehensive and effective evaluation of UniMedI, this study set up multi-angle observations and verified its performance and effectiveness by conducting comparative analysis with various medical VLP methods.

First, the research team compared UniMedI with methods including ConVIRT, GLoRIA, MGCA, LOVT, PRIOR, etc., which are tailored for X-rays and their corresponding medical reports; then, the research team compared UniMedI with several 2D and 3D joint learning methods, including UniMiss and Joint.

The linear classification experiment results show thatIn the 2D medical image classification experimental results (as shown below), compared with the state-of-the-art MGCA (ViT-b/16) method using ViT as the visual encoder, UniMedI performed best in three 2D medical image classifications under different training data (1%, 10%, 100%).

* Linear classification experiment: used to evaluate the representation ability of UniMedI

Compared with it, the AUROC of UniMedI on CheXpert dataset is improved by +0.6%, +0.6% and +0.8% respectively; the AUROC on RSNA dataset is improved by +0.9%, +0.5% and +0.7% respectively; the AUROC on COVID dataset is improved by +5.5%, +7.6% and +2.3% respectively. The experimental results show the effectiveness of the proposed algorithm.

2D linear classification results on CheXpert, RSNA and COVID datasets with 1%, 10% and 100% training data

In the 3D medical image classification experimental results (as shown below), compared with the most advanced UniMiss, UniMedI improved the ACC gain by +22.6%, +2.0% and +0.8% on the CC-CCII datasets, respectively. These data verify the data efficiency and effectiveness of UniMedI.

3D linear classification results on CC-CCII with 1%, 10%, and 100% training data

At the same time, when the full visual encoder is fine-tuned with the complete training data, UniMedI outperforms other methods on multiple 3D medical image datasets including CC-CCII and LUNA.

As shown in the figure below, the ACC value of UniMedI on the CC-CCII dataset is 93.8%, and the ACC value on the LUNA2016-v2 dataset is 95.9%. This shows its significant generalization ability in 2D and 3D medical image classification tasks, indicating that the framework has the ability to extract universal features of 3D CT images.

3D fine-tuning results on CC-CCII and RICORD datasets with full training data

The results of medical semantic segmentation experiments show thatIn the 2D medical semantic segmentation results, UniMedI significantly outperforms the current state-of-the-art MGCA algorithm, and achieves a Dice of 67.8% when using 1% of training data. In the 3D medical semantic segmentation results, UniMedI is compared with UniMiss on the BCV dataset, and when the limited label availability is 40% and 100%, the accuracy is improved by 0.6% and 0.4% respectively, as shown in the figure below.

* Medical semantic segmentation experiment: used to evaluate segmentation performance, using RSNA pneumonia frontal view chest radiographs, and BCV datasets (including 50 CT scans).

These results validate UniMedI’s strong superiority in extracting meaningful features and effectively utilizing limited annotated data, demonstrating its higher proficiency in leveraging local representations for semantic segmentation tasks.

Technology helps deepen the bond between VLP and medical imaging

Visual language pre-training models are becoming an important bridge between computer vision and natural language processing, especially in the field of medical imaging. Through pre-training on large-scale visual and language data, they can easily capture the complex relationship between complex medical images and texts, thereby assisting doctors in image diagnosis, helping companies in drug development, or realizing intelligent medical image management.

The fact that this research was selected for a top international conference also proves from another perspective the huge potential for VLP in the intersection of artificial intelligence and medical imaging.In fact, in addition to the strong collaboration between the two teams of Zhejiang University and Microsoft Research Asia, many laboratories have already made breakthroughs in this field.

For example, UniMiss, one of the advanced methods mentioned in the above study, was published in ECCV of the year in 2022 by a team from the University of Adelaide and the School of Computer Science of Northwestern Polytechnical University, with the title "UniMiss: Universal Medical Self-Supervised Learning via Breaking Dimensionality Barrier".

Paper address:
https://dl.acm.org/doi/abs/10.1007/978-3-031-19803-8_33

In this study, the authors advocate the use of a large number of 2D images to make up for the lack of 3D data, aiming to establish a general medical self-supervised expression learning framework named UniMiss.Experimental results show that UniMiss has great advantages over ImageNet pre-training and other advanced SSL (self-Supervised learning) opponents. In 2D/3D medical image analysis tasks, both segmentation and classification have satisfactory results.

In July this year, the team conducted a new round of research on UniMiss and proposed UniMiss+. Currently, the relevant results are included in the well-known international journal IEEE Transactions on Pattern Analysis and Machine Intelligence under the title "UniMiSS+: Universal Medical Self-Supervised Learning From Cross-Dimensional Unpaired Data".

Paper address:
https://ieeexplore.ieee.org/document/10617802

In the latest research, the team introduced digital reconstruction X-ray film technology in UniMiss+ to simulate CT scan X-ray images in order to access paired CT and X-ray image data. This is a huge improvement over the previous generation of UniMiss.

In short, the relevant scientific research integrating artificial intelligence and medical images is still in full swing. In time, these achievements will be transformed into applications and implemented in real medical scenarios, becoming new tools to benefit medical staff, patients, and enterprises.