Mask R-CNN (FPN, X-volution, SA) | 53.1 | 40 | 19.2 | 37.2 | X-volution: On the unification of convolution and self-attention | - |
MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train) | - | - | - | 50.5 | MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | |
BoTNet 50 (72 epochs) | - | - | - | 40.7 | Bottleneck Transformers for Visual Recognition | |
CBNetV2 (Dual-Swin-L HTC, multi-scale) | - | - | - | 51.8 | CBNet: A Composite Backbone Network Architecture for Object Detection | |
Faster R-CNN (Res2Net-50) | 53.7 | 37.9 | 15.7 | 35.6 | Res2Net: A New Multi-scale Backbone Architecture | |
Swin-L (HTC++, multi scale) | - | - | - | 50.4 | Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | |
ResNeSt-200 (multi-scale) | - | - | - | 46.25 | ResNeSt: Split-Attention Networks | |
ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) | - | - | - | 52.2 | Vision Transformer Adapter for Dense Predictions | |
ELSA-S (Cascade Mask RCNN) | - | - | - | 44.4 | ELSA: Enhanced Local Self-Attention for Vision Transformer | |
Mask R-CNN (ResNet-50-FPN, GRoIE) | 48.7 | 39 | 19.1 | 35.8 | A novel Region of Interest Extraction Layer for Instance Segmentation | |
CenterNet2 (Swin-L w/ X-Paste + Copy-Paste) | - | - | - | 48.8 | X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion | |
MViT-L (Mask R-CNN, single-scale) | - | - | - | 46.2 | MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | |