Video Instance Segmentation On Youtube Vis 1
Metrics
AP50
AP75
AR1
AR10
mask AP
Results
Performance results of various models on this benchmark
Comparison Table
Model Name | AP50 | AP75 | AR1 | AR10 | mask AP |
---|---|---|---|---|---|
crossover-learning-for-fast-online-video | 57.3 | 39.7 | 36 | 42 | 36.6 |
mdqe-mining-discriminative-query-embeddings | 84.9 | 67.3 | 53.5 | 65.0 | 59.9 |
in-defense-of-online-models-for-video | 74 | 52.9 | 47.7 | 58.7 | 49.5 |
msn-efficient-online-mask-selection-network | 69.4 | 54.9 | 40.1 | 55.0 | 48.8 |
video-instance-segmentation-using-inter-frame | 65.8 | 46.8 | 43.8 | 51.2 | 42.8 |
spatial-feature-calibration-and-temporal | 56.8 | 38.0 | 34.8 | 41.8 | 36.8 |
end-to-end-video-instance-segmentation-with | 59.8 | 36.9 | 37.2 | 42.4 | 36.2 |
efficient-video-object-segmentation-via | 28.6 | 33.1 | - | - | 29.1 |
simple-online-and-realtime-tracking-with-a | 31.3 | - | - | - | 27.8 |
end-to-end-video-instance-segmentation-with | 64.0 | 45.0 | 38.3 | 44.9 | 40.1 |
sipmask-spatial-information-preservation-for | 53 | 33.3 | 33.5 | 38.9 | 32.5 |
video-sparse-transformer-with-attention | - | - | - | - | 39.0 |
object-propagation-via-inter-frame-attentions | 59.4 | 39.2 | 39.1 | 47.7 | 36.0 |
seqformer-a-frustratingly-simple-model-for | 71.1 | 55.7 | 46.8 | 56.9 | 49.0 |
track-to-detect-and-segment-an-online-multi | 52.6 | 32.8 | - | - | 32.6 |
stem-seg-spatio-temporal-embeddings-for | 55.8 | 37.9 | 34.4 | 41.6 | 34.6 |
seqformer-a-frustratingly-simple-model-for | 69.8 | 51.8 | 45.5 | 54.8 | 47.4 |
instanceformer-an-online-video-instance | 78.0 | 64.2 | 50.9 | 61.6 | 56.3 |
video-instance-segmentation | 51.1 | 32.6 | 31 | 35.5 | 30.3 |
novis-a-case-for-end-to-end-near-online-video | 75.7 | 56.9 | 50.3 | 60.6 | 52.8 |
devis-making-deformable-transformers-work-for | 66.7 | 48.6 | 42.4 | 51.6 | 44.4 |
mask2former-for-video-instance-segmentation | 84.4 | 67.0 | - | - | 60.4 |
seqformer-a-frustratingly-simple-model-for | 82.1 | 66.4 | 51.7 | 64.4 | 59.3 |
stem-seg-spatio-temporal-embeddings-for | 50.7 | 37.9 | 34.4 | 41.6 | 30.6 |
prototypical-cross-attention-networks-for | 54.9 | 39.4 | 36.3 | 41.6 | 36.1 |
occluded-video-instance-segmentation | 55.6 | 38.1 | - | - | 35.1 |
minvis-a-minimal-video-instance-segmentation | 83.3 | 68.6 | 54.8 | 66.6 | 61.6 |
mask2former-for-video-instance-segmentation | 68.0 | 50.0 | - | - | 46.4 |
dvis-decoupled-video-instance-segmentation | 88.0 | 72.7 | 56.5 | 70.3 | 64.9 |
mask2former-for-video-instance-segmentation | 72.8 | 54.2 | - | - | 49.2 |
tube-link-a-flexible-cross-tube-baseline-for | 86.6 | 71.3 | 55.9 | 69.1 | 64.6 |
video-k-net-a-simple-strong-and-unified | 79.0 | 59.6 | 49.7 | 59.9 | 54.1 |
occluded-video-instance-segmentation | 52.8 | 34.9 | - | - | 32.1 |
instanceformer-an-online-video-instance | 68.6 | 49.6 | 42.1 | 53.5 | 45.6 |
devis-making-deformable-transformers-work-for | 80.8 | 66.3 | 50.8 | 61.0 | 57.1 |
seqformer-a-frustratingly-simple-model-for | 66.9 | 50.5 | 45.6 | 54.6 | 45.1 |
stc-spatio-temporal-contrastive-learning-for | 57.2 | 38.6 | 36.9 | 44.5 | 36.7 |
do-different-tracking-tasks-require-different | - | - | - | - | 30.1 |
univs-unified-and-universal-video | 82.1 | 65.3 | 54.7 | 66.8 | 60.0 |
sipmask-spatial-information-preservation-for | 54.1 | 35.8 | 35.4 | 40.1 | 33.7 |
compfeat-comprehensive-feature-aggregation | 56.0 | 38.6 | 33.1 | 40.3 | 35.3 |
dvis-improved-decoupled-framework-for | 88.8 | 75.3 | 57.9 | 73.7 | 67.7 |
1st-place-solution-for-youtubevos-challenge | 76.6 | 65.6 | 47 | 57.9 | 54.3 |