Referring Expression Segmentation On Refer 1

評価指標

Ju0026F

評価結果

このベンチマークにおける各モデルのパフォーマンス結果

モデル名	F	J	Ju0026F	Paper Title	Repository
MTTR (w=12)	56.64	54.00	55.32	End-to-End Referring Video Object Segmentation with Multimodal Transformers
ReferFormer (ResNet-50)	56.6	54.8	55.6	Language as Queries for Referring Video Object Segmentation
ReferDINO (Swin-B)	71.5	67.0	69.3	ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations	-
MLRLSA	48.43	50.96	49.70	Multi-Level Representation Learning With Semantic Alignment for Referring Video Object Segmentation	-
GLEE-Pro	72.9	68.2	70.6	General Object Foundation Model for Images and Videos at Scale
URVOS	50.8	47.0	48.9	URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark	-
MPG-SAM 2	76.1	71.7	73.9	MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation
HTR (Pre-training)	68.9	65.3	67.1	Temporally Consistent Referring Video Object Segmentation with Hybrid Memory
ViLLa	68.6	64.6	66.5	ViLLa: Video Reasoning Segmentation with Large Language Model
UniRef++-L	69.0	64.8	66.9	UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
SOC (Video-Swin-T)	60.5	57.8	59.2	SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation
UNINEXT-H	72.7	67.6	70.1	Universal Instance Perception as Object Discovery and Retrieval
UniLSeg-100	67.0	62.8	64.9	Universal Segmentation at Arbitrary Granularity with Language Instruction
GroPrompt	66.9	64.1	65.5	GroPrompt: Efficient Grounded Prompting and Adaptation for Referring Video Object Segmentation	-
VRS-HQ (Chat-UniVi-13B)	73.1	69	71	The Devil is in Temporal Token: High Quality Video Reasoning Segmentation
R2VOS (Video-Swin-T)	63.1	59.6	61.3	Towards Robust Referring Video Object Segmentation with Cyclic Relational Consensus
MUTR	70.4	66.4	68.4	Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation
SOC (Joint training, Video-Swin-B)	69.3	65.3	67.3±0.5	SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation
VLT	65.6	61.9	63.8	VLT: Vision-Language Transformer and Query Generation for Referring Segmentation
LoSh-R	66.0	62.5	64.2	LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation

0 of 33 row(s) selected.