Mask2Former (Swin-L, single-scale) | 34.9 | 54.7 | 40 | 16.3 | Masked-attention Mask Transformer for Universal Image Segmentation | |
DiNAT-L (Mask2Former, single-scale) | 35.4 | 55.5 | 39.0 | 16.3 | Dilated Neighborhood Attention Transformer | |
X-Decoder (Davit-d5, Deform, single-scale, 1280x1280) | 38.7 | 59.6 | 43.3 | 18.9 | Generalized Decoding for Pixel, Image, and Language | - |
OneFormer (DiNAT-L, single-scale) | 36.0 | - | - | - | OneFormer: One Transformer to Rule Universal Image Segmentation | |
OneFormer (Swin-L, single-scale) | 35.9 | - | - | - | OneFormer: One Transformer to Rule Universal Image Segmentation | |
OneFormer (DiNAT-L, single-scale, 1280x1280, COCO-pretrain) | 40.2 | 59.7 | 44.4 | 19.2 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
OneFormer (InternImage-H, emb_dim=1024, single-scale, 896x896, COCO-Pretrained) | 44.2 | 64.3 | 49.9 | 23.7 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
Mask2Former (Swin-L + FAPN) | 33.4 | 54.6 | 37.6 | 14.6 | Masked-attention Mask Transformer for Universal Image Segmentation | |