VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Scale is the primary factor for building a powerful foundation model thatcould well generalize to a variety of downstream tasks. However, it is stillchallenging to train video foundation models with billions of parameters. Thispaper shows that video masked autoencoder (VideoMAE) is a scalable and generalself-supervised pre-trainer for building video foundation models. We scale theVideoMAE in both model and data with a core design. Specifically, we present adual masking strategy for efficient pre-training, with an encoder operating ona subset of video tokens and a decoder processing another subset of videotokens. Although VideoMAE is very efficient due to high masking ratio inencoder, masking decoder can still further reduce the overall computationalcost. This enables the efficient pre-training of billion-level models in video.We also use a progressive training paradigm that involves an initialpre-training on a diverse multi-sourced unlabeled dataset, followed by apost-pre-training on a mixed labeled dataset. Finally, we successfully train avideo ViT model with a billion parameters, which achieves a newstate-of-the-art performance on the datasets of Kinetics (90.0% on K400 and89.9% on K600) and Something-Something (68.7% on V1 and 77.0% on V2). Inaddition, we extensively verify the pre-trained video ViT models on a varietyof downstream tasks, demonstrating its effectiveness as a general videorepresentation learner. The code and model is available at\url{https://github.com/OpenGVLab/VideoMAEv2}.