SeMask (SeMask Swin-L FaPN-Mask2Former) | 58.2 | SeMask: Semantically Masked Transformers for Semantic Segmentation | |
BEiT-L (ViT+UperNet, ImageNet-22k pretrain) | 57.0 | BEiT: BERT Pre-Training of Image Transformers | |
Swin-L (UperNet, ImageNet-22k pretrain) | 53.5 | Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | |
SeMask (SeMask Swin-L MSFaPN-Mask2Former, single-scale) | 57.0 | SeMask: Semantically Masked Transformers for Semantic Segmentation | |
Mask2Former (Swin-L-FaPN, multiscale) | 57.7 | Masked-attention Mask Transformer for Universal Image Segmentation | |
Twins-SVT-L (UperNet, ImageNet-1k pretrain) | 50.2 | Twins: Revisiting the Design of Spatial Attention in Vision Transformers | |
OneFormer (InternImage-H, emb_dim=256, multi-scale, 896x896) | 60.8 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
ViT-Adapter-L (UperNet, BEiT pretrain) | 58.4 | Vision Transformer Adapter for Dense Predictions | |
Light-Ham (VAN-Large, 46M, IN-1k, MS) | 51.0 | Is Attention Better Than Matrix Decomposition? | |