CenterMask + VoVNetV2-99 (single-scale) | 62.3 | 44.1 | 57.0 | 42.8 | 20.1 | 40.6 | CenterMask : Real-Time Anchor-Free Instance Segmentation | |
EmbedMask(R-101-FPN) | 59.1 | 40.3 | - | 40.4 | 17.9 | 37.7 | EmbedMask: Embedding Coupling for One-stage Instance Segmentation | |
VirTex Mask R-CNN (ResNet-50-FPN) | 58.4 | 39.7 | - | - | - | 36.9 | VirTex: Learning Visual Representations from Textual Annotations | |
DiffusionInst-ResNet101 | - | - | - | - | - | 41.5 | DiffusionInst: Diffusion Model for Instance Segmentation | |
Cascade Mask R-CNN (ResNeXt152, CBNet) | - | - | - | - | - | 43.3 | CBNet: A Novel Composite Backbone Network Architecture for Object Detection | |
ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) | - | - | - | - | - | 53.0 | Vision Transformer Adapter for Dense Predictions | |
PolarMask (ResNet-101-FPN) | 51.9% | 31% | 42.8% | 32.4% | 13.4% | 30.4% | PolarMask: Single Shot Instance Segmentation with Polar Representation | |
Co-DETR | 80.2 | 63.4 | 72.0 | 60.1 | 41.6 | 57.1 | DETRs with Collaborative Hybrid Assignments Training | |
MogaNet-B (Cascade Mask R-CNN) | - | - | - | - | - | 46 | MogaNet: Multi-order Gated Aggregation Network | |
DetectoRS (ResNeXt-101-32x4d, multi-scale) | 71.1 | 51.6 | 59.6 | 49.5 | 30.3 | 47.1 | DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution | |
GCNet (ResNeXt-101 + DCN + cascade + GC r16) | - | - | - | - | - | 41.5% | GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond | |
SOLQ (ResNet50, single scale) | - | - | - | - | - | 39.7 | SOLQ: Segmenting Objects by Learning Queries | |