Video Object Segmentation using Space-Time Memory Networks

We propose a novel solution for semi-supervised video object segmentation. Bythe nature of the problem, available cues (e.g. video frame(s) with objectmasks) become richer with the intermediate predictions. However, the existingmethods are unable to fully exploit this rich source of information. We resolvethe issue by leveraging memory networks and learn to read relevant informationfrom all available sources. In our framework, the past frames with object masksform an external memory, and the current frame as the query is segmented usingthe mask information in the memory. Specifically, the query and the memory aredensely matched in the feature space, covering all the space-time pixellocations in a feed-forward fashion. Contrast to the previous approaches, theabundant use of the guidance information allows us to better handle thechallenges such as appearance changes and occlussions. We validate our methodon the latest benchmark sets and achieved the state-of-the-art performance(overall score of 79.4 on Youtube-VOS val set, J of 88.7 and 79.2 on DAVIS2016/2017 val set respectively) while having a fast runtime (0.16 second/frameon DAVIS 2016 val set).