HyperAI초신경
홈
뉴스
최신 연구 논문
튜토리얼
데이터셋
백과사전
SOTA
LLM 모델
GPU 랭킹
컨퍼런스
전체 검색
소개
한국어
HyperAI초신경
Toggle sidebar
전체 사이트 검색...
⌘
K
홈
SOTA
Video Retrieval
Video Retrieval On Didemo
Video Retrieval On Didemo
평가 지표
text-to-video R@1
text-to-video R@10
text-to-video R@5
평가 결과
이 벤치마크에서 각 모델의 성능 결과
Columns
모델 이름
text-to-video R@1
text-to-video R@10
text-to-video R@5
Paper Title
Repository
RTQ
57.6
89.9
84.1
RTQ: Rethinking Video-language Understanding Based on Image-text Model
FROZEN
31.0
72.4
59.8
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
DiffusionRet+QB-Norm
48.9
83.3
75.5
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
HunYuan_tvr (huge)
52.7
85.2
77.8
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
-
HD-VILA
28.8
69.1
57.4
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
QB-Norm+CLIP4Clip
43.5
80.9
71.4
Cross Modal Retrieval with Querybank Normalisation
Cap4Video
52.0
87.5
79.4
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
STAN
54.6
85.1
78.4
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring
CLIP-ViP
55.3
89.3
82
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
ALPRO
35.9
78.8
67.5
Align and Prompt: Video-and-Language Pre-training with Entity Prompts
DRL
49.0
84.5
76.5
Disentangled Representation Learning for Text-Video Retrieval
VAST
72.0
91.4
89.0
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
DMAE (ViT-B/32)
52.7
86.6
79.3
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
-
mPLUG-2
56.4
85.2
79.1
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Clover
50.1
85.6
76.7
Clover: Towards A Unified Video-Language Alignment and Fusion Model
VindLU
61.2
91.0
85.8
VindLU: A Recipe for Effective Video-and-Language Pretraining
Collaborative Experts
16.1
54.4
41.1
Use What You Have: Video Retrieval Using Representations From Collaborative Experts
MuLTI
56.5
87.0
80.2
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
-
UMT-L (ViT-L/16)
70.4
93.5
90.1
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
HiTeA
56.5
89.7
81.7
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
-
0 of 40 row(s) selected.
Previous
Next