PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model

PSALM is a powerful extension of the Large Multi-modal Model (LMM) to addressthe segmentation task challenges. To overcome the limitation of the LMM beinglimited to textual output, PSALM incorporates a mask decoder and awell-designed input schema to handle a variety of segmentation tasks. Thisschema includes images, task instructions, conditional prompts, and masktokens, which enable the model to generate and classify segmentation maskseffectively. The flexible design of PSALM supports joint training acrossmultiple datasets and tasks, leading to improved performance and taskgeneralization. PSALM achieves superior results on several benchmarks, such asRefCOCO/RefCOCO+/RefCOCOg, COCO Panoptic Segmentation, and COCO-Interactive,and further exhibits zero-shot capabilities on unseen tasks, such asopen-vocabulary segmentation, generalized referring expression segmentation andvideo object segmentation, making a significant step towards a GPT moment incomputer vision. Through extensive experiments, PSALM demonstrates itspotential to transform the domain of image segmentation, leveraging the robustvisual understanding capabilities of LMMs as seen in natural languageprocessing. Code and models are available at https://github.com/zamling/PSALM.