CrossVIS (ResNet-101) | 57.3 | 39.7 | 36 | 42 | 36.6 | Crossover Learning for Fast Online Video Instance Segmentation | |
IDOL (ResNet-50) | 74 | 52.9 | 47.7 | 58.7 | 49.5 | In Defense of Online Models for Video Instance Segmentation | |
VisTR(ResNet-50) | 59.8 | 36.9 | 37.2 | 42.4 | 36.2 | End-to-End Video Instance Segmentation with Transformers | |
VisTR(ResNet-101) | 64.0 | 45.0 | 38.3 | 44.9 | 40.1 | End-to-End Video Instance Segmentation with Transformers | |
SipMask (ResNet-50, single-scale test) | 53 | 33.3 | 33.5 | 38.9 | 32.5 | SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation | |
SeqFormer (ResNet-101) | 71.1 | 55.7 | 46.8 | 56.9 | 49.0 | SeqFormer: Sequential Transformer for Video Instance Segmentation | |
STEm-Seg (ResNet-101) | 55.8 | 37.9 | 34.4 | 41.6 | 34.6 | STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos | |
SeqFormer (ResNet-50) | 69.8 | 51.8 | 45.5 | 54.8 | 47.4 | SeqFormer: Sequential Transformer for Video Instance Segmentation | |
InstanceFormer(Swin-L) | 78.0 | 64.2 | 50.9 | 61.6 | 56.3 | InstanceFormer: An Online Video Instance Segmentation Framework | |
MaskTrack R-CNN (ResNet-50, single-scale training and test) | 51.1 | 32.6 | 31 | 35.5 | 30.3 | Video Instance Segmentation | |