HyperAIHyperAI

Command Palette

Search for a command to run...

SimVTP: Simple Video Text Pre-training with Masked Autoencoders

Yue Ma Tianyu Yang Yin Shan Xiu Li

Abstract

This paper presents SimVTP: a Simple Video-Text Pretraining framework via masked autoencoders. We randomly mask out the spatial-temporal tubes of input video and the word tokens of input text and then feed them into a unified autencoder to reconstruct the missing pixels and words. Our SimVTP has several properties: 1) Thanks to the unified autoencoder, SimVTP reconstructs the masked signal of one modality with the help from another modality, which implicitly learns the cross-modal alignment between video tubes and text tokens. 2) SimVTP not only benefits from a high video masking ratio (e.g. 90%) due to the temporal redundancy of video, but also needs a high text masking ratio (e.g. 75%), which is much higher than BERT (e.g. 15%), to achieve optimal performance. This is because the aid of video modality makes text reconstruction less challenging, which thus needs a higher mask ratio to make the pretext harder for useful feature learning. 3) Equipping SimVTP with video-text contrastive learning (VTC) and video-text matching (VTM), which are two commonly used cross-modal training strategies, could further improve the transferable performance significantly. 4) SimVTP is dataefficent, e.g., pre-training only on 10% data of WebVid-2M, SimVTP achieves surprisingly good results (43.8 R@1) on MSRVTT, which is far above recent state-of-the-art methods pre-trained on both CC3M and WebVid-2M. We transfer our pre-trained model to various downstream tasks and achieve superior performance. The codes and models will be released at https://github.com/mayuelala/SimVTP.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
SimVTP: Simple Video Text Pre-training with Masked Autoencoders | Papers | HyperAI