Spatiotemporal Graph Neural Network based Mask Reconstruction for Video Object Segmentation

This paper addresses the task of segmenting class-agnostic objects insemi-supervised setting. Although previous detection based methods achieverelatively good performance, these approaches extract the best proposal by agreedy strategy, which may lose the local patch details outside the chosencandidate. In this paper, we propose a novel spatiotemporal graph neuralnetwork (STG-Net) to reconstruct more accurate masks for video objectsegmentation, which captures the local contexts by utilizing all proposals. Inthe spatial graph, we treat object proposals of a frame as nodes and representtheir correlations with an edge weight strategy for mask context aggregation.To capture temporal information from previous frames, we use a memory networkto refine the mask of current frame by retrieving historic masks in a temporalgraph. The joint use of both local patch details and temporal relationshipsallow us to better address the challenges such as object occlusion and missing.Without online learning and fine-tuning, our STG-Net achieves state-of-the-artperformance on four large benchmarks (DAVIS, YouTube-VOS, SegTrack-v2, andYouTube-Objects), demonstrating the effectiveness of the proposed approach.