HyperAI

Action Classification On Kinetics 600

Métriques

Top-1 Accuracy

Résultats

Résultats de performance de divers modèles sur ce benchmark

Nom du modèle
Top-1 Accuracy
Paper TitleRepository
D3D+S3D-G79.1D3D: Distilled 3D Networks for Video Action Recognition
XViT (x16)84.5Space-time Mixing Attention for Video Transformer
MoViNet-A582.7MoViNets: Mobile Video Networks for Efficient Video Recognition
VideoMAE V2-g88.8VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
MoViNet-A277.5MoViNets: Mobile Video Networks for Efficient Video Recognition
PERF-Net (distilled ResNet50-G)82.0PERF-Net: Pose Empowered RGB-Flow Net-
mPLUG-289.8mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Florence (curated FLD-900M pretrain)87.8Florence: A New Foundation Model for Computer Vision
MoViNet-A683.5MoViNets: Mobile Video Networks for Efficient Video Recognition
S3D-G (RGB)76.6Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
Modèle 1189.7MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound-
UniFormer-B (ImageNet-1K)84.8UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning
LGD-3D Flow75Learning Spatio-Temporal Representation with Local and Global Diffusion-
TokenLearner 16at18 w. Fuser (L/10)86.3TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
SlowFast 8x8 (ResNet-50)79.9SlowFast Networks for Video Recognition
UMT-L (ViT-L/16)90.5Unmasked Teacher: Towards Training-Efficient Video Foundation Models
SlowFast 16x8 (ResNet-101 + NL)81.8SlowFast Networks for Video Recognition
EVA89.8%EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
I3D (RGB)73.6A Short Note about Kinetics-600
TubeVit-L91.5Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
0 of 65 row(s) selected.