HyperAI
الرئيسية
الأخبار
أحدث الأوراق البحثية
الدروس
مجموعات البيانات
الموسوعة
SOTA
نماذج LLM
لوحة الأداء GPU
الفعاليات
البحث
حول
العربية
HyperAI
Toggle sidebar
البحث في الموقع...
⌘
K
الرئيسية
SOTA
Video Retrieval
Video Retrieval On Msr Vtt
Video Retrieval On Msr Vtt
المقاييس
text-to-video R@1
text-to-video R@10
text-to-video R@5
النتائج
نتائج أداء النماذج المختلفة على هذا المعيار القياسي
Columns
اسم النموذج
text-to-video R@1
text-to-video R@10
text-to-video R@5
Paper Title
Repository
TEFAL
52
86.1
76.6
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment
-
VideoCoCa (zero-shot)
34.3
67.0
57.8
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
-
TACo
24.8
64.0
52.1
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
-
VIOLETv2
37.2
75.8
64.8
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
COSA
57.9
-
-
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
CoCa (zero-shot)
30.0
61.6
52.4
CoCa: Contrastive Captioners are Image-Text Foundation Models
CLIP
21.4
50.4
41.1
A Straightforward Framework For Video Retrieval Using CLIP
RoME
10.7
41.2
29.6
RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval
InternVideo2-6B
62.8
-
-
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
C+LSTM+SA+FC7
4.2
19.9
-
Learning Language-Visual Embedding for Movie Understanding with Natural-Language
-
VALOR
59.9
89.6
83.5
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
GRAM
64
89.3
-
Gramian Multimodal Representation Learning and Alignment
Aurora (ours, r=64)
52.4
82
73.9
-
-
Kaufman
4.7
24.1
-
Temporal Tessellation: A Unified Approach for Video Analysis
Ours
26
-
56.7
Video and Text Matching with Conditioned Embeddings
Text-Video Embedding
14.9
52.8
-
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
FROZEN
32.5
71.2
61.5
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
LAFF
29.1
65.8
54.9
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval
All-in-one + MELTR
38.6
84.7
74.4
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
JSFusion
10.2
43.2
-
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
0 of 40 row(s) selected.
Previous
Next