MTTR (w=10) | 0.392 | 0.698 | 0.701 | 0.939 | 0.852 | 0.616 | 0.166 | 0.001 | End-to-End Referring Video Object Segmentation with Multimodal Transformers | |
Hui et al. | 0.335 | 0.604 | 0.598 | 0.783 | 0.639 | 0.378 | 0.076 | 0.000 | Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor Segmentation | - |
ClawCraneNet | - | 0.655 | 0.644 | 0.880 | 0.796 | 0.566 | 0.147 | 0.002 | ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation | - |
Gavrilyuk et al. | 0.233 | 0.542 | 0.541 | 0.699 | 0.460 | 0.173 | 0.014 | 0.000 | Actor and Action Video Segmentation from a Sentence | |
CMPC-V | 0.342 | 0.617 | 0.616 | 0.813 | 0.657 | 0.371 | 0.07 | 0.000 | Cross-Modal Progressive Comprehension for Referring Segmentation | |
AAMN | 0.321 | 0.576 | 0.583 | 0.773 | 0.627 | 0.360 | 0.044 | 0.000 | Actor and Action Modular Network for Text-based Video Segmentation | - |
SgMg (Video-Swin-B) | 0.450 | 0.725 | 0.737 | 0.972 | 0.917 | 0.714 | 0.225 | 0.003 | Spectrum-guided Multi-granularity Referring Video Object Segmentation | |
VLIDE | 0.441 | 0.666 | 0.68 | 0.874 | 0.791 | 0.586 | 0.182 | 0.30 | Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation | - |
VT-Capsule | 0.261 | 0.550 | 0.535 | 0.677 | 0.513 | 0.283 | 0.051 | 0.000 | Visual-Textual Capsule Routing for Text-Based Video Segmentation | - |
MTTR (w=8) | 0.366 | 0.679 | 0.674 | 0.91 | 0.815 | 0.57 | 0.144 | 0.001 | End-to-End Referring Video Object Segmentation with Multimodal Transformers | |
Hu et al. | 0.178 | 0.528 | 0.546 | 0.633 | 0.350 | 0.085 | 0.002 | 0.000 | Segmentation from Natural Language Expressions | |
SOC (Video-Swin-B) | 0.446 | 0.723 | 0.736 | 0.969 | 0.914 | 0.711 | 0.213 | 0.001 | SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation | |
SOC (Video-Swin-T) | 0.397 | 0.701 | 0.707 | 0.947 | 0.864 | 0.627 | 0.179 | 0.001 | SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation | |
Li et al. | 0.173 | 0.491 | 0.529 | 0.578 | 0.335 | 0.103 | 0.060 | 0.000 | Tracking by Natural Language Specification | - |
Gavrilyuk et al. (Optical flow) | 0.267 | 0.570 | 0.555 | 0.712 | 0.518 | 0.264 | 0.030 | 0.000 | Actor and Action Video Segmentation from a Sentence | |