FocalNet-L (Mask2Former (200 queries)) | 48.4 | 57.9 | Focal Modulation Networks | |
OpenSeeD (SwinL, single-scale) | 53.2 | 59.5 | A Simple Framework for Open-Vocabulary Segmentation and Detection | |
Visual Attention Network (VAN-B6 + Mask2Former) | - | 58.2 | Visual Attention Network | |
Panoptic FCN* (ResNet-50-FPN) | - | 44.3 | Fully Convolutional Networks for Panoptic Segmentation | |
MaskFormer (single-scale) | - | 52.7 | Per-Pixel Classification is Not All You Need for Semantic Segmentation | |
kMaX-DeepLab (single-scale, drop query with 256 queries) | - | 58.0 | kMaX-DeepLab: k-means Mask Transformer | |
DETR-R101 (ResNet-101) | 33 | 45.1 | End-to-End Object Detection with Transformers | |
Axial-DeepLab-L (multi-scale) | - | 43.9 | Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation | |
PanopticFPN++ | 39.7 | 44.1 | End-to-End Object Detection with Transformers | |
kMaX-DeepLab (single-scale, pseudo-labels) | - | 58.1 | kMaX-DeepLab: k-means Mask Transformer | |
MasK DINO (SwinL,single-scale) | 50.9 | 59.4 | Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation | |
OneFormer (InternImage-H,single-scale) | 52.0 | 60.0 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
Axial-DeepLab-L(multi-scale) | - | - | Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation | |
kMaX-DeepLab (single-scale) | - | 57.9 | kMaX-DeepLab: k-means Mask Transformer | |
DiNAT-L (single-scale, Mask2Former) | 49.2 | 58.5 | Dilated Neighborhood Attention Transformer | |
OneFormer (Swin-L, single-scale) | 49.0 | 57.9 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
HyperSeg (Swin-B) | - | 61.2 | HyperSeg: Towards Universal Visual Segmentation with Large Language Model | |
Axial-DeepLab-L (single-scale) | - | 43.4 | Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation | |
OneFormer (DiNAT-L, single-scale) | 49.2 | 58.0 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former) | 48.9 | 58.4 | Vision Transformer Adapter for Dense Predictions | |