HyperAI초신경
홈
뉴스
최신 연구 논문
튜토리얼
데이터셋
백과사전
SOTA
LLM 모델
GPU 랭킹
컨퍼런스
전체 검색
소개
한국어
HyperAI초신경
Toggle sidebar
전체 사이트 검색...
⌘
K
홈
SOTA
Video Retrieval
Video Retrieval On Msr Vtt 1Ka
Video Retrieval On Msr Vtt 1Ka
평가 지표
text-to-video Median Rank
text-to-video R@1
text-to-video R@10
text-to-video R@5
평가 결과
이 벤치마크에서 각 모델의 성능 결과
Columns
모델 이름
text-to-video Median Rank
text-to-video R@1
text-to-video R@10
text-to-video R@5
Paper Title
Repository
UniVL + MELTR
4
31.1
68.3
55.7
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
Side4Video
1.0
52.3
84.2
75.5
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
CLIP2TV
1
52.9
86.5
78.5
CLIP2TV: Align, Match and Distill for Video-Text Retrieval
-
OmniVec
-
-
89.4
-
OmniVec: Learning robust representations with cross modal sharing
-
HiTeA
-
46.8
81.9
71.2
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
-
LAFF
-
45.8
82
71.5
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval
TeachCLIP (ViT-B/16)
-
48.0
83.5
75.9
Holistic Features are almost Sufficient for Text-to-Video Retrieval
COTS
2
36.8
73.2
63.8
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
-
Florence
-
37.6
72.6
63.8
Florence: A New Foundation Model for Computer Vision
Singularity
-
41.5
77
68.7
Revealing Single Frame Bias for Video-and-Language Learning
X-CLIP
2.0
49.3
84.8
75.8
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
CLIP4Clip
2
-
81.6
-
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
TACo
4
28.4
71.2
57.8
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
-
MDMMT
2
38.9
79.7
69.0
MDMMT: Multidomain Multimodal Transformer for Video Retrieval
RTQ
-
53.4
84.4
76.1
RTQ: Rethinking Video-language Understanding Based on Image-text Model
UCoFiA
-
49.4
83.5
72.1
Unified Coarse-to-Fine Alignment for Video-Text Retrieval
-
All-in-one + MELTR
-
41.3
82.5
73.5
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
PIDRo
1.0
55.9
87.6
79.8
PIDRo: Parallel Isomeric Attention with Dynamic Routing for Text-Video Retrieval
-
STAN
1
54.1
87.8
79.5
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring
FROZEN
3
31.0
70.5
59.5
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
0 of 63 row(s) selected.
Previous
Next