2 months ago

Could Giant Pretrained Image Models Extract Universal Representations?

Lin, Yutong ; Liu, Ze ; Zhang, Zheng ; Hu, Han ; Zheng, Nanning ; Lin, Stephen ; Cao, Yue

Abstract

Frozen pretrained models have become a viable alternative to thepretraining-then-finetuning paradigm for transfer learning. However, withfrozen models there are relatively few parameters available for adapting todownstream tasks, which is problematic in computer vision where tasks varysignificantly in input/output format and the type of information that is ofvalue. In this paper, we present a study of frozen pretrained models whenapplied to diverse and representative computer vision tasks, including objectdetection, semantic segmentation and video action recognition. From thisempirical analysis, our work answers the questions of what pretraining taskfits best with this frozen setting, how to make the frozen setting moreflexible to various downstream tasks, and the effect of larger model sizes. Weadditionally examine the upper bound of performance using a giant frozenpretrained model with 3 billion parameters (SwinV2-G) and find that it reachescompetitive performance on a varied set of major benchmarks with only oneshared frozen base network: 60.0 box mAP and 52.2 mask mAP on COCO objectdetection test-dev, 57.6 val mIoU on ADE20K semantic segmentation, and 81.7top-1 accuracy on Kinetics-400 action recognition. With this work, we hope tobring greater attention to this promising path of freezing pretrained imagemodels.