HyperAI超神经
首页
资讯
最新论文
教程
数据集
百科
SOTA
LLM 模型天梯
GPU 天梯
顶会
开源项目
全站搜索
关于
中文
HyperAI超神经
Toggle sidebar
全站搜索…
⌘
K
首页
SOTA
Video Retrieval
Video Retrieval On Msr Vtt 1Ka
Video Retrieval On Msr Vtt 1Ka
评估指标
text-to-video Median Rank
text-to-video R@1
text-to-video R@10
text-to-video R@5
评测结果
各个模型在此基准测试上的表现结果
Columns
模型名称
text-to-video Median Rank
text-to-video R@1
text-to-video R@10
text-to-video R@5
Paper Title
Repository
UniVL + MELTR
4
31.1
68.3
55.7
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
Side4Video
1.0
52.3
84.2
75.5
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
CLIP2TV
1
52.9
86.5
78.5
CLIP2TV: Align, Match and Distill for Video-Text Retrieval
-
OmniVec
-
-
89.4
-
OmniVec: Learning robust representations with cross modal sharing
-
HiTeA
-
46.8
81.9
71.2
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
-
LAFF
-
45.8
82
71.5
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval
TeachCLIP (ViT-B/16)
-
48.0
83.5
75.9
Holistic Features are almost Sufficient for Text-to-Video Retrieval
COTS
2
36.8
73.2
63.8
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
-
Florence
-
37.6
72.6
63.8
Florence: A New Foundation Model for Computer Vision
Singularity
-
41.5
77
68.7
Revealing Single Frame Bias for Video-and-Language Learning
X-CLIP
2.0
49.3
84.8
75.8
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
CLIP4Clip
2
-
81.6
-
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
TACo
4
28.4
71.2
57.8
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
-
MDMMT
2
38.9
79.7
69.0
MDMMT: Multidomain Multimodal Transformer for Video Retrieval
RTQ
-
53.4
84.4
76.1
RTQ: Rethinking Video-language Understanding Based on Image-text Model
UCoFiA
-
49.4
83.5
72.1
Unified Coarse-to-Fine Alignment for Video-Text Retrieval
-
All-in-one + MELTR
-
41.3
82.5
73.5
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
PIDRo
1.0
55.9
87.6
79.8
PIDRo: Parallel Isomeric Attention with Dynamic Routing for Text-Video Retrieval
-
STAN
1
54.1
87.8
79.5
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring
FROZEN
3
31.0
70.5
59.5
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
0 of 63 row(s) selected.
Previous
Next