8 months ago

Abstract

Open-vocabulary image segmentation aims to partition an image into semanticregions according to arbitrary text descriptions. However, complex visualscenes can be naturally decomposed into simpler parts and abstracted atmultiple levels of granularity, introducing inherent segmentation ambiguity.Unlike existing methods that typically sidestep this ambiguity and treat it asan external factor, our approach actively incorporates a hierarchicalrepresentation encompassing different semantic-levels into the learningprocess. We propose a decoupled text-image fusion mechanism and representationlearning modules for both "things" and "stuff". Additionally, we systematicallyexamine the differences that exist in the textual and visual features betweenthese types of categories. Our resulting model, named HIPIE, tacklesHIerarchical, oPen-vocabulary, and unIvErsal segmentation tasks within aunified framework. Benchmarked on over 40 datasets, e.g., ADE20K, COCO,Pascal-VOC Part, RefCOCO/RefCOCOg, ODinW and SeginW, HIPIE achieves thestate-of-the-art results at various levels of image comprehension, includingsemantic-level (e.g., semantic segmentation), instance-level (e.g.,panoptic/referring segmentation and object detection), as well as part-level(e.g., part/subpart segmentation) tasks. Our code is released athttps://github.com/berkeley-hipie/HIPIE.

Source PDF View Code