HyperAI
HyperAI
الرئيسية
المنصة
الوثائق
الأخبار
الأوراق البحثية
الدروس
مجموعات البيانات
الموسوعة
SOTA
نماذج LLM
لوحة الأداء GPU
الفعاليات
البحث
حول
شروط الخدمة
سياسة الخصوصية
العربية
HyperAI
HyperAI
Toggle Sidebar
البحث في الموقع...
⌘
K
Command Palette
Search for a command to run...
المنصة
الرئيسية
SOTA
استرجاع الفيديو
Video Retrieval On Didemo
Video Retrieval On Didemo
المقاييس
text-to-video R@1
text-to-video R@10
text-to-video R@5
النتائج
نتائج أداء النماذج المختلفة على هذا المعيار القياسي
Columns
اسم النموذج
text-to-video R@1
text-to-video R@10
text-to-video R@5
Paper Title
InternVideo2-6B
74.2
-
-
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
vid-TLDR (UMT-L)
72.3
94.2
91.2
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
VAST
72.0
91.4
89.0
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
COSA
70.5
-
-
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
UMT-L (ViT-L/16)
70.4
93.5
90.1
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
GRAM
67.3
90.1
-
Gramian Multimodal Representation Learning and Alignment
VALOR
61.5
90.4
85.3
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
VindLU
61.2
91.0
85.8
VindLU: A Recipe for Effective Video-and-Language Pretraining
TESTA (ViT-B/16)
61.2
91.5
87.2
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
InternVideo
57.9
-
-
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
RTQ
57.6
89.9
84.1
RTQ: Rethinking Video-language Understanding Based on Image-text Model
VLAB
56.8
88.7
81.6
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
MuLTI
56.5
87.0
80.2
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
HiTeA
56.5
89.7
81.7
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
mPLUG-2
56.4
85.2
79.1
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
CLIP-ViP
55.3
89.3
82
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
STAN
54.6
85.1
78.4
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring
Singularity
53.9
86.9
79.4
Revealing Single Frame Bias for Video-and-Language Learning
HunYuan_tvr (huge)
52.7
85.2
77.8
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
DMAE (ViT-B/32)
52.7
86.6
79.3
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
0 of 40 row(s) selected.
Previous
Next
Video Retrieval On Didemo | SOTA | HyperAI