Efficient Regional Memory Network for Video Object Segmentation

Recently, several Space-Time Memory based networks have shown that the objectcues (e.g. video frames as well as the segmented object masks) from the pastframes are useful for segmenting objects in the current frame. However, thesemethods exploit the information from the memory by global-to-global matchingbetween the current and past frames, which lead to mismatching to similarobjects and high computational complexity. To address these problems, wepropose a novel local-to-local matching solution for semi-supervised VOS,namely Regional Memory Network (RMNet). In RMNet, the precise regional memoryis constructed by memorizing local regions where the target objects appear inthe past frames. For the current query frame, the query regions are tracked andpredicted based on the optical flow estimated from the previous frame. Theproposed local-to-local matching effectively alleviates the ambiguity ofsimilar objects in both memory and query frames, which allows the informationto be passed from the regional memory to the query region efficiently andeffectively. Experimental results indicate that the proposed RMNet performsfavorably against state-of-the-art methods on the DAVIS and YouTube-VOSdatasets.