Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information

To effectively exploit the potential of large-scale models, variouspre-training strategies supported by massive data from different sources areproposed, including supervised pre-training, weakly-supervised pre-training,and self-supervised pre-training. It has been proved that combining multiplepre-training strategies and data from various modalities/sources can greatlyboost the training of large-scale models. However, current works adopt amulti-stage pre-training system, where the complex pipeline may increase theuncertainty and instability of the pre-training. It is thus desirable thatthese strategies can be integrated in a single-stage manner. In this paper, wefirst propose a general multi-modal mutual information formula as a unifiedoptimization target and demonstrate that all existing approaches are specialcases of our framework. Under this unified perspective, we propose anall-in-one single-stage pre-training approach, named Maximizing Multi-modalMutual Information Pre-training (M3I Pre-training). Our approach achievesbetter performance than previous pre-training methods on various visionbenchmarks, including ImageNet classification, COCO object detection, LVISlong-tailed object detection, and ADE20k semantic segmentation. Notably, wesuccessfully pre-train a billion-level parameter image backbone and achievestate-of-the-art performance on various benchmarks. Code shall be released athttps://github.com/OpenGVLab/M3I-Pretraining.