ResNeSt-200 (multi-scale) | 71.00 | 57.07 | 66.29 | 56.36 | 36.80 | 52.47 | ResNeSt: Split-Attention Networks | |
ExtremeNet (Hourglass-104, single-scale) | 55.1 | 43.7 | 56.1 | 44.0 | 21.6 | 40.3 | Bottom-up Object Detection by Grouping Extreme and Center Points | |
Mask R-CNN (ResNeXt-101-FPN) | 59.5 | 38.9 | - | - | - | 36.7 | Mask R-CNN | |
FPN+ | 61.3 | 43.3 | 52.6 | 43.3 | 22.9 | 39.8 | Feature Pyramid Networks for Object Detection | |
Hiera-L | - | - | - | - | - | 55 | Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles | |
MOAT-2 (IN-22K pretraining, single-scale) | - | - | - | - | - | 58.5 | MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models | |
Focal-Stable-DINO (Focal-Huge, no TTA) | 81.5 | 71.4 | 78.5 | 68.5 | 50.4 | 64.6 | A Strong and Reproducible Object Detector with Only Public Datasets | |
BoTNet 152 (Mask R-CNN, single scale, 72 epochs) | 71 | 54.2 | - | - | - | 49.5 | Bottleneck Transformers for Visual Recognition | |
ResNeSt-200-DCN (single-scale) | 69.53 | 55.40 | 65.83 | 54.66 | 32.67 | 50.91 | ResNeSt: Split-Attention Networks | |
Cascade R-CNN (ResNet-101-FPN+, cascade) | 61.6 | 46.6 | 57.4 | 46.2 | 23.8 | 42.7 | Cascade R-CNN: Delving into High Quality Object Detection | |
Mask R-CNN (ResNet-101 + 1 NL) | 63.1 | 44.5 | - | - | - | 40.8 | Non-local Neural Networks | |
DETR-DC5 (ResNet-101) | 64.7 | 47.7 | 62.3 | 49.5 | 23.7 | 44.9 | End-to-End Object Detection with Transformers | |
XCiT-S24/8 | - | - | - | - | - | 48.1 | XCiT: Cross-Covariance Image Transformers | |
DETR-ResNet50 with iRPE-K (150 epochs) | - | - | - | - | - | 40.8 | Rethinking and Improving Relative Position Encoding for Vision Transformer | |
SwinV2-G (HTC++) | - | - | - | - | - | 62.5 | Swin Transformer V2: Scaling Up Capacity and Resolution | |
Mask R-CNN (ResNet-101, DCNv2) | - | - | - | - | - | 43.1 | Deformable ConvNets v2: More Deformable, Better Results | |
GCnet (ResNet-50-FPN, GRoIE) | 62.4 | 44 | 52.5 | 44.4 | 24.2 | 40.3 | GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond | |
RPDet (ResNeXt-101-DCN, multi-scale) | - | - | - | - | - | 46.8 | RepPoints: Point Set Representation for Object Detection | |
FoveaBox+aLRP Loss (ResNet-50, 500 scale) | 58.8 | 41.5 | - | - | - | 39.7 | A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object Detection | |
Faster R-CNN+aLRP Loss (ResNet-50, 500 scale) | 60.7 | 43.3 | - | - | - | 40.7 | A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object Detection | |