HyperAI超神経
ホーム
ニュース
最新論文
チュートリアル
データセット
百科事典
SOTA
LLMモデル
GPU ランキング
学会
検索
サイトについて
日本語
HyperAI超神経
Toggle sidebar
サイトを検索…
⌘
K
ホーム
SOTA
Video Retrieval
Video Retrieval On Msvd
Video Retrieval On Msvd
評価指標
text-to-video R@1
video-to-text R@1
評価結果
このベンチマークにおける各モデルのパフォーマンス結果
Columns
モデル名
text-to-video R@1
video-to-text R@1
Paper Title
Repository
InternVideo2-6B
61.4
85.2
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
InternVideo
58.4
76.3
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
VLAB
57.5
-
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
-
CAMoE
51.8
69.3
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss
DiffusionRet+QB-Norm
47.9
60.3
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
X-CLIP
50.4
66.8
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
MDMMT-2
56.8
-
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization
-
PAU
47.3
68.9
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval
CLIP4Clip
46.2
62.0
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
DiffusionRet
46.6
61.9
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
CLIP
37
59.9
A Straightforward Framework For Video Retrieval Using CLIP
Cap4Video
51.8
70.0
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
HunYuan_tvr (huge)
59.0
73.0
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
-
Collaborative Experts
19.8
-
Use What You Have: Video Retrieval Using Representations From Collaborative Experts
vid-TLDR (UMT-L)
57.9
82.7
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
LAFF
45.4
-
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval
QB-Norm+CLIP2Video
48.0
-
Cross Modal Retrieval with Querybank Normalisation
X-Pool
47.2
66.4
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
CenterCLIP (ViT-B/16)
50.6
68.4
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval
DMAE (ViT-B/32)
48.7
-
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
-
0 of 24 row(s) selected.
Previous
Next