2 months ago

EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

Yuxuan Zhang, Tianheng Cheng, Rui Hu, ei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, Xinggang Wang

Abstract

Segment Anything Model (SAM) has attracted widespread attention for itssuperior interactive segmentation capabilities with visual prompts whilelacking further exploration of text prompts. In this paper, we empiricallyinvestigate what text prompt encoders (e.g., CLIP or LLM) are good for adaptingSAM for referring expression segmentation and introduce the EarlyVision-language Fusion-based SAM (EVF-SAM). EVF-SAM is a simple yet effectivereferring segmentation method which exploits multimodal prompts (i.e., imageand text) and comprises a pre-trained vision-language model to generatereferring prompts and a SAM model for segmentation. Surprisingly, we observethat: (1) multimodal prompts and (2) vision-language models with early fusion(e.g., BEIT-3) are beneficial for prompting SAM for accurate referringsegmentation. Our experiments show that the proposed EVF-SAM based on BEIT-3can obtain state-of-the-art performance on RefCOCO/+/g for referring expressionsegmentation and demonstrate the superiority of prompting SAM with earlyvision-language fusion. In addition, the proposed EVF-SAM with 1.32B parametersachieves remarkably higher performance while reducing nearly 82% of parameterscompared to previous SAM methods based on large multimodal models.