HyperAI
HyperAI
الرئيسية
المنصة
الوثائق
الأخبار
الأوراق البحثية
الدروس
مجموعات البيانات
الموسوعة
SOTA
نماذج LLM
لوحة الأداء GPU
الفعاليات
البحث
حول
شروط الخدمة
سياسة الخصوصية
العربية
HyperAI
HyperAI
Toggle Sidebar
البحث في الموقع...
⌘
K
Command Palette
Search for a command to run...
المنصة
الرئيسية
SOTA
تصنيف الإجراءات
Action Classification On Kinetics 600
Action Classification On Kinetics 600
المقاييس
Top-1 Accuracy
النتائج
نتائج أداء النماذج المختلفة على هذا المعيار القياسي
Columns
اسم النموذج
Top-1 Accuracy
Paper Title
InternVideo2-6B
91.9
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
TubeVit-H
91.8
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
InternVideo2-1B
91.6
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
TubeVit-L
91.5
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
InternVideo-T
91.3
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
النموذج 45
91.1
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
TubeVit-B
90.9
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
UMT-L (ViT-L/16)
90.5
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
MTV-H (WTS 60M)
90.3
Multiview Transformers for Video Recognition
UniFormerV2-L
90.1
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
VideoMAE V2-g (64x266x266)
89.9
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
mPLUG-2
89.8
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
EVA
89.8%
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
النموذج 11
89.7
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
CoCa (finetuned)
89.4
CoCa: Contrastive Captioners are Image-Text Foundation Models
النموذج 55
89.4
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
VideoMAE V2-g
88.8
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Hiera-H (no extra data)
88.8
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
CoCa (frozen)
88.5
CoCa: Contrastive Captioners are Image-Text Foundation Models
X-CLIP(ViT-L/14, CLIP)
88.3
Expanding Language-Image Pretrained Models for General Video Recognition
0 of 65 row(s) selected.
Previous
Next
Action Classification On Kinetics 600 | SOTA | HyperAI