TEC (Vit-B, Upernet) | - | - | 51.0 | Towards Sustainable Self-supervised Learning | |
M3I Pre-training (InternImage-H) | - | 1310 | 62.9 | Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information | |
SwinV2-G-HTC++ Liu et al. ([2021a]) | - | - | 53.7 | Swin Transformer V2: Scaling Up Capacity and Resolution | |
Sequential Ensemble (DeepLabv3+) | - | - | 46.8 | Sequential Ensembling for Semantic Segmentation | - |
ACNet (ResNet-101) | - | - | 45.90 | Adaptive Context Network for Scene Parsing | - |
SeMask (SeMask Swin-L MSFaPN-Mask2Former) | - | - | 58.2 | SeMask: Semantically Masked Transformers for Semantic Segmentation | |