Gavriluyk el al. (Optical flow) | 0.215 | 0.426 | 0.551 | 0.5 | 0.376 | 0.231 | 0.094 | 0.004 | Actor and Action Video Segmentation from a Sentence | |
VLIDE | 0.469 | 0.598 | 0.714 | 0.702 | 0.663 | 0.585 | 0.428 | 0.151 | Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation | - |
ReferFormer (Video-Swin-B) | 0.550 | 0.703 | 0.786 | 0.831 | 0.804 | 0.741 | 0.579 | 0.212 | Language as Queries for Referring Video Object Segmentation | |
Hui et al. | 0.399 | 0.561 | 0.662 | 0.654 | 0.589 | 0.497 | 0.333 | 0.091 | Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor Segmentation | - |
MANET | 0.471 | 0.632 | 0.726 | 0.734 | 0.682 | 0.579 | 0.389 | 0.132 | Multi-Attention Network for Compressed Video Referring Object Segmentation | |
ClawCraneNet | - | 0.655 | 0.644 | 0.704 | 0.677 | 0.617 | 0.489 | 0.171 | ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation | - |
CMPC-V (R2D) | 0.351 | 0.515 | 0.649 | 0.590 | 0.527 | 0.434 | 0.284 | 0.068 | Cross-Modal Progressive Comprehension for Referring Segmentation | |
SOC (Video-Swin-B) | 0.573 | 0.725 | 0.807 | 0.851 | 0.827 | 0.765 | 0.607 | 0.252 | SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation | |
MTTR (w=8) | 0.447 | 0.618 | 0.702 | 0.721 | 0.684 | 0.607 | 0.456 | 0.164 | End-to-End Referring Video Object Segmentation with Multimodal Transformers | |
AAMN | 0.396 | 0.552 | 0.617 | 0.681 | 0.629 | 0.523 | 0.296 | 0.029 | Actor and Action Modular Network for Text-based Video Segmentation | - |
CMPC-V (I3D) | 0.404 | 0.573 | 0.653 | 0.655 | 0.592 | 0.506 | 0.342 | 0.098 | Cross-Modal Progressive Comprehension for Referring Segmentation | |
Locater | 0.465 | 0.597 | 0.69 | 0.709 | 0.64 | 0.525 | 0.351 | 0.101 | Local-Global Context Aware Transformer for Language-Guided Video Segmentation | |
VT-Capsule | 0.303 | 0.460 | 0.568 | 0.526 | 0.450 | 0.345 | 0.207 | 0.036 | Visual-Textual Capsule Routing for Text-Based Video Segmentation | - |
Hu et al. | 0.132 | 0.350 | 0.474 | 0.348 | 0.236 | 0.133 | 0.033 | 0.000 | Segmentation from Natural Language Expressions | |
Gavriluyk el al. | 0.198 | 0.421 | 0.536 | 0.475 | 0.347 | 0.211 | 0.08 | 0.002 | Actor and Action Video Segmentation from a Sentence | |
MTTR (w=10) | 0.461 | 0.64 | 0.72 | 0.754 | 0.712 | 0.638 | 0.485 | 0.169 | End-to-End Referring Video Object Segmentation with Multimodal Transformers | |