YouTube-VOS: Sequence-to-Sequence Video Object Segmentation

Learning long-term spatial-temporal features are critical for many videoanalysis tasks. However, existing video segmentation methods predominantly relyon static image segmentation techniques, and methods capturing temporaldependency for segmentation have to depend on pretrained optical flow models,leading to suboptimal solutions for the problem. End-to-end sequential learningto explore spatial-temporal features for video segmentation is largely limitedby the scale of available video segmentation datasets, i.e., even the largestvideo segmentation dataset only contains 90 short video clips. To solve thisproblem, we build a new large-scale video object segmentation dataset calledYouTube Video Object Segmentation dataset (YouTube-VOS). Our dataset contains3,252 YouTube video clips and 78 categories including common objects and humanactivities. This is by far the largest video object segmentation dataset to ourknowledge and we have released it at https://youtube-vos.org. Based on thisdataset, we propose a novel sequence-to-sequence network to fully exploitlong-term spatial-temporal information in videos for segmentation. Wedemonstrate that our method is able to achieve the best results on ourYouTube-VOS test set and comparable results on DAVIS 2016 compared to thecurrent state-of-the-art methods. Experiments show that the large scale datasetis indeed a key factor to the success of our model.