Multimodal Referring Segmentation: A Survey

Multimodal referring segmentation aims to segment target objects in visualscenes, such as images, videos, and 3D scenes, based on referring expressionsin text or audio format. This task plays a crucial role in practicalapplications requiring accurate object perception based on user instructions.Over the past decade, it has gained significant attention in the multimodalcommunity, driven by advances in convolutional neural networks, transformers,and large language models, all of which have substantially improved multimodalperception capabilities. This paper provides a comprehensive survey ofmultimodal referring segmentation. We begin by introducing this field'sbackground, including problem definitions and commonly used datasets. Next, wesummarize a unified meta architecture for referring segmentation and reviewrepresentative methods across three primary visual scenes, including images,videos, and 3D scenes. We further discuss Generalized Referring Expression(GREx) methods to address the challenges of real-world complexity, along withrelated tasks and practical applications. Extensive performance comparisons onstandard benchmarks are also provided. We continually track related works athttps://github.com/henghuiding/Awesome-Multimodal-Referring-Segmentation.