HyperAI
Home
News
Latest Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
English
HyperAI
Toggle sidebar
Search the site…
⌘
K
Home
SOTA
Referring Expression Segmentation
Referring Expression Segmentation On Refer 1
Referring Expression Segmentation On Refer 1
Metrics
F
J
Ju0026F
Results
Performance results of various models on this benchmark
Columns
Model Name
F
J
Ju0026F
Paper Title
Repository
MTTR (w=12)
56.64
54.00
55.32
End-to-End Referring Video Object Segmentation with Multimodal Transformers
ReferFormer (ResNet-50)
56.6
54.8
55.6
Language as Queries for Referring Video Object Segmentation
ReferDINO (Swin-B)
71.5
67.0
69.3
ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations
-
MLRLSA
48.43
50.96
49.70
Multi-Level Representation Learning With Semantic Alignment for Referring Video Object Segmentation
-
GLEE-Pro
72.9
68.2
70.6
General Object Foundation Model for Images and Videos at Scale
URVOS
50.8
47.0
48.9
URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark
MPG-SAM 2
76.1
71.7
73.9
MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation
-
HTR (Pre-training)
68.9
65.3
67.1
Temporally Consistent Referring Video Object Segmentation with Hybrid Memory
ViLLa
68.6
64.6
66.5
ViLLa: Video Reasoning Segmentation with Large Language Model
UniRef++-L
69.0
64.8
66.9
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
SOC (Video-Swin-T)
60.5
57.8
59.2
SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation
UNINEXT-H
72.7
67.6
70.1
Universal Instance Perception as Object Discovery and Retrieval
UniLSeg-100
67.0
62.8
64.9
Universal Segmentation at Arbitrary Granularity with Language Instruction
GroPrompt
66.9
64.1
65.5
GroPrompt: Efficient Grounded Prompting and Adaptation for Referring Video Object Segmentation
-
VRS-HQ (Chat-UniVi-13B)
73.1
69
71
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation
R2VOS (Video-Swin-T)
63.1
59.6
61.3
Towards Robust Referring Video Object Segmentation with Cyclic Relational Consensus
MUTR
70.4
66.4
68.4
Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation
SOC (Joint training, Video-Swin-B)
69.3
65.3
67.3±0.5
SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation
VLT
65.6
61.9
63.8
VLT: Vision-Language Transformer and Query Generation for Referring Segmentation
LoSh-R
66.0
62.5
64.2
LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation
0 of 33 row(s) selected.
Previous
Next