Mask2Former (Swin-L, single-scale) | 43.7 | Masked-attention Mask Transformer for Universal Image Segmentation | |
OneFormer (ConvNeXt-L, single-scale, Mapillary-Pretrained) | 48.7 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
DiNAT-L (single-scale, Mask2Former) | 45.1 | Dilated Neighborhood Attention Transformer | |
OpenSeeD( SwinL, single-scale) | 49.3 | A Simple Framework for Open-Vocabulary Segmentation and Detection | |
AFF-Base (single-scale, point-based Mask2Former) | 46.2 | AutoFocusFormer: Image Segmentation off the Grid | |
OneFormer (Swin-L, single-scale) | 45.6 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
AFF-Small (single-scale, point-based Mask2Former) | 44.0 | AutoFocusFormer: Image Segmentation off the Grid | |
OneFormer (DiNAT-L, single-scale) | 45.6 | OneFormer: One Transformer to Rule Universal Image Segmentation | |