Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation

The Segment Anything Model (SAM), a profound vision foundation modelpretrained on a large-scale dataset, breaks the boundaries of generalsegmentation and sparks various downstream applications. This paper introducesHi-SAM, a unified model leveraging SAM for hierarchical text segmentation.Hi-SAM excels in segmentation across four hierarchies, including pixel-leveltext, word, text-line, and paragraph, while realizing layout analysis as well.Specifically, we first turn SAM into a high-quality pixel-level textsegmentation (TS) model through a parameter-efficient fine-tuning approach. Weuse this TS model to iteratively generate the pixel-level text labels in asemi-automatical manner, unifying labels across the four text hierarchies inthe HierText dataset. Subsequently, with these complete labels, we launch theend-to-end trainable Hi-SAM based on the TS architecture with a customizedhierarchical mask decoder. During inference, Hi-SAM offers both automatic maskgeneration (AMG) mode and promptable segmentation (PS) mode. In the AMG mode,Hi-SAM segments pixel-level text foreground masks initially, then samplesforeground points for hierarchical text mask generation and achieves layoutanalysis in passing. As for the PS mode, Hi-SAM provides word, text-line, andparagraph masks with a single point click. Experimental results show thestate-of-the-art performance of our TS model: 84.86% fgIOU on Total-Text and88.96% fgIOU on TextSeg for pixel-level text segmentation. Moreover, comparedto the previous specialist for joint hierarchical detection and layout analysison HierText, Hi-SAM achieves significant improvements: 4.73% PQ and 5.39% F1 onthe text-line level, 5.49% PQ and 7.39% F1 on the paragraph level layoutanalysis, requiring $20\times$ fewer training epochs. The code is available athttps://github.com/ymy-k/Hi-SAM.