8 months ago

Abstract

This paper strives for motion expressions guided video segmentation, whichfocuses on segmenting objects in video content based on a sentence describingthe motion of the objects. Existing referring video object datasets typicallyfocus on salient objects and use language expressions that contain excessivestatic attributes that could potentially enable the target object to beidentified in a single frame. These datasets downplay the importance of motionin video content for language-guided video object segmentation. To investigatethe feasibility of using motion expressions to ground and segment objects invideos, we propose a large-scale dataset called MeViS, which contains numerousmotion expressions to indicate target objects in complex environments. Webenchmarked 5 existing referring video object segmentation (RVOS) methods andconducted a comprehensive comparison on the MeViS dataset. The results showthat current RVOS methods cannot effectively address motion expression-guidedvideo segmentation. We further analyze the challenges and propose a baselineapproach for the proposed MeViS dataset. The goal of our benchmark is toprovide a platform that enables the development of effective language-guidedvideo segmentation algorithms that leverage motion expressions as a primary cuefor object segmentation in complex video scenes. The proposed MeViS dataset hasbeen released at https://henghuiding.github.io/MeViS.

Source PDF View Code