Prediction Calibration for Generalized Few-shot Semantic Segmentation

Generalized Few-shot Semantic Segmentation (GFSS) aims to segment each imagepixel into either base classes with abundant training examples or novel classeswith only a handful of (e.g., 1-5) training images per class. Compared to thewidely studied Few-shot Semantic Segmentation FSS, which is limited tosegmenting novel classes only, GFSS is much under-studied despite being morepractical. Existing approach to GFSS is based on classifier parameter fusionwhereby a newly trained novel class classifier and a pre-trained base classclassifier are combined to form a new classifier. As the training data isdominated by base classes, this approach is inevitably biased towards the baseclasses. In this work, we propose a novel Prediction Calibration Network PCN toaddress this problem. Instead of fusing the classifier parameters, we fuse thescores produced separately by the base and novel classifiers. To ensure thatthe fused scores are not biased to either the base or novel classes, a newTransformer-based calibration module is introduced. It is known that thelower-level features are useful of detecting edge information in an input imagethan higher-level features. Thus, we build a cross-attention module that guidesthe classifier's final prediction using the fused multi-level features.However, transformers are computationally demanding. Crucially, to make theproposed cross-attention module training tractable at the pixel level, thismodule is designed based on feature-score cross-covariance and episodicallytrained to be generalizable at inference time. Extensive experiments onPASCAL-$5^{i}$ and COCO-$20^{i}$ show that our PCN outperforms thestate-the-the-art alternatives by large margins.