Mask2Former (Swin-L) | 34.2 | 48.1 | 54.5 | Masked-attention Mask Transformer for Universal Image Segmentation | |
OneFormer (ConvNeXt-L, single-scale, 640x640) | 36.2 | 50.0 | 56.6 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
OpenSeed(SwinL, single scale, 1280x1280) | - | 53.7 | - | A Simple Framework for Open-Vocabulary Segmentation and Detection | |
OneFormer (DiNAT-L, single-scale, 640x640) | 36.0 | 50.5 | 58.3 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
X-Decoder (Davit-d5, Deform, single-scale, 1280x1280) | 38.7 | 52.4 | 59.1 | Generalized Decoding for Pixel, Image, and Language | - |
DiNAT-L (Mask2Former, 640x640) | 35.0 | 49.4 | 56.3 | Dilated Neighborhood Attention Transformer | |
X-Decoder (L) | 35.8 | 49.6 | 58.1 | Generalized Decoding for Pixel, Image, and Language | - |
Mask2Former (ResNet-50, 640x640) | 26.5 | - | 46.1 | Masked-attention Mask Transformer for Universal Image Segmentation | |
Mask2Former (ResNet-50, 640x640) | - | 39.7 | - | Masked-attention Mask Transformer for Universal Image Segmentation | |
kMaX-DeepLab (ResNet50, single-scale, 1281x1281) | - | 42.3 | 45.3 | kMaX-DeepLab: k-means Mask Transformer | |
kMaX-DeepLab (ConvNeXt-L, single-scale, 1281x1281) | - | 50.9 | 55.2 | kMaX-DeepLab: k-means Mask Transformer | |
Mask2Former (Swin-L + FAPN, 640x640) | 33.2 | 46.2 | 55.4 | Masked-attention Mask Transformer for Universal Image Segmentation | |
kMaX-DeepLab (ConvNeXt-L, single-scale, 641x641) | - | 48.7 | 54.8 | kMaX-DeepLab: k-means Mask Transformer | |
OneFormer (InternImage-H, emb_dim=256, single-scale, 896x896) | 40.2 | 54.5 | 60.4 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
OneFormer (DiNAT-L, single-scale, 1280x1280, COCO-Pretrain) | - | 53.4 | 58.9 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
MaskFormer (R101 + 6 Enc) | - | 35.7 | - | Per-Pixel Classification is Not All You Need for Semantic Segmentation | |
OneFormer (ConvNeXt-XL, single-scale, 640x640) | 36.3 | 50.1 | 57.4 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
OneFormer (DiNAT-L, single-scale, 1280x1280) | 37.1 | 51.5 | 58.3 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
OneFormer (Swin-L, single-scale, 1280x1280) | 37.8 | 51.4 | 57.0 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
kMaX-DeepLab (ResNet50, single-scale, 641x641) | - | 41.5 | 45.0 | kMaX-DeepLab: k-means Mask Transformer | |