2 months ago

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Miech, Antoine ; Zhukov, Dimitri ; Alayrac, Jean-Baptiste ; Tapaswi, Makarand ; Laptev, Ivan ; Sivic, Josef

Abstract

Learning text-video embeddings usually requires a dataset of video clips withmanually provided captions. However, such datasets are expensive and timeconsuming to create and therefore difficult to obtain on a large scale. In thiswork, we propose instead to learn such embeddings from video data with readilyavailable natural language annotations in the form of automatically transcribednarrations. The contributions of this work are three-fold. First, we introduceHowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22Mnarrated instructional web videos depicting humans performing and describingover 23k different visual tasks. Our data collection procedure is fast,scalable and does not require any additional manual annotation. Second, wedemonstrate that a text-video embedding trained on this data leads tostate-of-the-art results for text-to-video retrieval and action localization oninstructional video datasets such as YouCook2 or CrossTask. Finally, we showthat this embedding transfers well to other domains: fine-tuning on genericYoutube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms modelstrained on these datasets alone. Our dataset, code and models will be publiclyavailable at: www.di.ens.fr/willow/research/howto100m/.