SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers

This paper investigates the capability of plain Vision Transformers (ViTs)for semantic segmentation using the encoder-decoder framework and introduces\textbf{SegViTv2}. In this study, we introduce a novel Attention-to-Mask (\atm)module to design a lightweight decoder effective for plain ViT. The proposedATM converts the global attention map into semantic masks for high-qualitysegmentation results. Our decoder outperforms the popular decoder UPerNet usingvarious ViT backbones while consuming only about $5\%$ of the computationalcost. For the encoder, we address the concern of the relatively highcomputational cost in the ViT-based encoders and propose a \emph{Shrunk++}structure that incorporates edge-aware query-based down-sampling (EQD) andquery-based upsampling (QU) modules. The Shrunk++ structure reduces thecomputational cost of the encoder by up to $50\%$ while maintaining competitiveperformance. Furthermore, we propose to adapt SegViT for continual semanticsegmentation, demonstrating nearly zero forgetting of previously learnedknowledge. Experiments show that our proposed SegViTv2 surpasses recentsegmentation methods on three popular benchmarks including ADE20k,COCO-Stuff-10k and PASCAL-Context datasets. The code is available through thefollowing link: \url{https://github.com/zbwxp/SegVit}.