HyperAI超神経
ホーム
ニュース
最新論文
チュートリアル
データセット
百科事典
SOTA
LLMモデル
GPU ランキング
学会
検索
サイトについて
日本語
HyperAI超神経
Toggle sidebar
サイトを検索…
⌘
K
ホーム
SOTA
Video Retrieval
Video Retrieval On Activitynet
Video Retrieval On Activitynet
評価指標
text-to-video Median Rank
text-to-video R@1
text-to-video R@5
text-to-video R@50
評価結果
このベンチマークにおける各モデルのパフォーマンス結果
Columns
モデル名
text-to-video Median Rank
text-to-video R@1
text-to-video R@5
text-to-video R@50
Paper Title
Repository
HD-VILA
4
28.5
57.4
94
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Singularity
-
47.1
75.5
-
Revealing Single Frame Bias for Video-and-Language Learning
RTQ
-
53.5
81.4
-
RTQ: Rethinking Video-language Understanding Based on Image-text Model
MMT-Pretrained
3.3
28.7
61.4
94.5
Multi-modal Transformer for Video Retrieval
Ours
-
25.4
59.1
-
Video and Text Matching with Conditioned Embeddings
CAMoE
1
51.0
77.7
-
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss
InternVideo
-
62.2
-
-
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
DiffusionRet
2.0
45.8
75.6
-
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
X-CLIP
-
46.2
75.5
-
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
HiTeA
-
49.7
77.1
-
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
-
VALOR
-
70.1
90.8
-
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
CLIP4Clip
2
40.5
73.4
98.2
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
HBI
2.0
42.2
73.0
-
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
EMCL-Net
-
41.2
72.7
-
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
EMCL-Net++
-
50.6
78.7
98.1
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
COSA
-
67.3
-
-
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
DiffusionRet+QB-Norm
2.0
48.1
-
-
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
MMT
5
22.7
54.2
93.2
Multi-modal Transformer for Video Retrieval
CLIP-ViP
1
61.4
85.7
-
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
DMAE (ViT-B/32)
1.0
53.4
80.7
-
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
-
0 of 31 row(s) selected.
Previous
Next