AFF-Base (single-scale, point-based Mask2Former) | 46.2 | 67.7 | 71.5 | 62.5 | 83.0 | AutoFocusFormer: Image Segmentation off the Grid | |
AdaptIS (ResNeXt-101) | 36.3 | 62.0 | 64.4 | 58.7 | 79.2 | AdaptIS: Adaptive Instance Selection Network | |
AUNet (ResNet-101-FPN) | 34.4 | 59.0 | 62.1 | 54.8 | 75.6 | Attention-guided Unified Network for Panoptic Segmentation | - |
DiNAT-L (Mask2Former) | 44.5 | 67.2 | - | - | 83.4 | Dilated Neighborhood Attention Transformer | |
TASCNet (ResNet-50, multi-scale) | 39 | 60.4 | 63.3 | 56.1 | 78 | Learning to Fuse Things and Stuff | - |
Panoptic-DeepLab (X71) | 38.5 | 64.1 | - | - | 81.5 | Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation | |
AdaptIS (ResNet-101) | 33.9 | 60.6 | 62.9 | 57.5 | 77.2 | AdaptIS: Adaptive Instance Selection Network | |
Panoptic FCN* (ResNet-50-FPN) | - | - | 66.6 | - | - | Fully Convolutional Networks for Panoptic Segmentation | |
CMT-DeepLab (MaX-S, single-scale, IN-1K) | - | 64.6 | - | - | 81.4 | CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation | |
OneFormer (ConvNeXt-XL, single-scale) | 46.7 | 68.4 | - | - | 83.6 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
Dynamically Instantiated Network (ResNet-101) | 28.6 | 53.8 | 62.1 | 42.5 | 79.8 | Weakly- and Semi-Supervised Panoptic Segmentation | |
DeeperLab (Xception-71) | - | 56.5 | - | - | - | DeeperLab: Single-Shot Image Parser | - |
Axial-DeepLab-XL (Mapillary Vistas, multi-scale) | 44.2 | 68.5 | - | - | 84.6 | Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation | |
COPS (ResNet-50) | 34.1 | 62.1 | 67.2 | 55.1 | 79.3 | Combinatorial Optimization for Panoptic Segmentation: A Fully Differentiable Approach | |
Panoptic FCN* (Swin-L, Cityscapes-fine) | - | - | 70.6 | 59.5 | - | Fully Convolutional Networks for Panoptic Segmentation | |
AFF-Small (single-scale, point-based Mask2Former) | 44.2 | 66.9 | 70.8 | 61.5 | 82.2 | AutoFocusFormer: Image Segmentation off the Grid | |
Panoptic FPN (ResNet-101) | 33.0 | 58.1 | 62.5 | 52.0 | 75.7 | Panoptic Feature Pyramid Networks | |
OneFormer (Swin-L, single-scale) | 45.6 | 67.2 | - | - | 83.0 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
TASCNet (ResNet-50) | 37.6 | 59.2 | 61.5 | 56 | 77.8 | Learning to Fuse Things and Stuff | - |
OneFormer (DiNAT-L, single-scale) | 45.6 | 67.6 | - | - | 83.1 | OneFormer: One Transformer to Rule Universal Image Segmentation | |