Referring Expression Segmentation On A2D

Metrics

IoU mean

IoU overall

Precision@0.5

Precision@0.6

Precision@0.7

Precision@0.8

Precision@0.9

Results

Performance results of various models on this benchmark

Model Name	AP	IoU mean	IoU overall	Precision@0.5	Precision@0.6	Precision@0.7	Precision@0.8	Precision@0.9	Paper Title	Repository
CMDy	0.333	0.531	0.623	0.607	0.525	0.405	0.235	0.045	Context Modulated Dynamic Networks for Actor and Action Video Segmentation with Language Queries	-
Gavriluyk el al. (Optical flow)	0.215	0.426	0.551	0.5	0.376	0.231	0.094	0.004	Actor and Action Video Segmentation from a Sentence
VLIDE	0.469	0.598	0.714	0.702	0.663	0.585	0.428	0.151	Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation	-
ReferFormer (Video-Swin-B)	0.550	0.703	0.786	0.831	0.804	0.741	0.579	0.212	Language as Queries for Referring Video Object Segmentation
Hui et al.	0.399	0.561	0.662	0.654	0.589	0.497	0.333	0.091	Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor Segmentation	-
MANET	0.471	0.632	0.726	0.734	0.682	0.579	0.389	0.132	Multi-Attention Network for Compressed Video Referring Object Segmentation
ACGA	0.274	0.490	0.601	0.557	0.459	0.319	0.16	0.02	Asymmetric Cross-Guided Attention Network for Actor and Action Video Segmentation From Natural Language Query	-
ClawCraneNet	-	0.655	0.644	0.704	0.677	0.617	0.489	0.171	ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation	-
CMPC-V (R2D)	0.351	0.515	0.649	0.590	0.527	0.434	0.284	0.068	Cross-Modal Progressive Comprehension for Referring Segmentation
SOC (Video-Swin-B)	0.573	0.725	0.807	0.851	0.827	0.765	0.607	0.252	SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation
MTTR (w=8)	0.447	0.618	0.702	0.721	0.684	0.607	0.456	0.164	End-to-End Referring Video Object Segmentation with Multimodal Transformers
RefVOS	-	0.599	0.599	0.495	-	-	-	0.064	RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation
AAMN	0.396	0.552	0.617	0.681	0.629	0.523	0.296	0.029	Actor and Action Modular Network for Text-based Video Segmentation	-
CMPC-V (I3D)	0.404	0.573	0.653	0.655	0.592	0.506	0.342	0.098	Cross-Modal Progressive Comprehension for Referring Segmentation
Locater	0.465	0.597	0.69	0.709	0.64	0.525	0.351	0.101	Local-Global Context Aware Transformer for Language-Guided Video Segmentation
VT-Capsule	0.303	0.460	0.568	0.526	0.450	0.345	0.207	0.036	Visual-Textual Capsule Routing for Text-Based Video Segmentation	-
Hu et al.	0.132	0.350	0.474	0.348	0.236	0.133	0.033	0.000	Segmentation from Natural Language Expressions
Gavriluyk el al.	0.198	0.421	0.536	0.475	0.347	0.211	0.08	0.002	Actor and Action Video Segmentation from a Sentence
MTTR (w=10)	0.461	0.64	0.72	0.754	0.712	0.638	0.485	0.169	End-to-End Referring Video Object Segmentation with Multimodal Transformers
CMSA+CFSA	-	0.432	0.618	0.487	0.431	0.358	0.231	0.052	Referring Segmentation in Images and Videos with Cross-Modal Self-Attention Network	-

0 of 27 row(s) selected.