HyperAI

Action Classification On Kinetics 400

Metrics

Acc@1

Results

Performance results of various models on this benchmark

Model Name
Acc@1
Paper TitleRepository
OmniVec91.1OmniVec: Learning robust representations with cross modal sharing-
X3D-L77.5X3D: Expanding Architectures for Efficient Video Recognition
ViT-B-VTN+ ImageNet-21K (84.0 [10])79.8Video Transformer Network
MViT-B, 32x380.2Multiscale Vision Transformers
MTV-H (WTS 60M)89.9Multiview Transformers for Video Recognition
AdaMAE81.7AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders
ViC-MAE (ViT-L)85.1ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders
MoViNet-A480.5MoViNets: Mobile Video Networks for Efficient Video Recognition
SlowFast 16x8 (ResNet-101)78.9SlowFast Networks for Video Recognition
R[2+1]D-RGB (Sports-1M pretrain)74.3A Closer Look at Spatiotemporal Convolutions for Action Recognition
X-CLIP(ViT-L/14, CLIP)87.7Expanding Language-Image Pretrained Models for General Video Recognition
ip-CSN-152 (IG-65M pretraining)82.5Video Classification with Channel-Separated Convolutional Networks
MARS+RGB+Flow (64 frames)74.9MARS: Motion-Augmented RGB Stream for Action Recognition-
TokenLearner 16at18 (L/10)85.4TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
VideoMamba-M80085.0VideoMamba: State Space Model for Efficient Video Understanding
TAdaConvNeXt-T79.1TAda! Temporally-Adaptive Convolutions for Video Understanding
MAR (50% mask, ViT-B, 16x4)81.0MAR: Masked Autoencoders for Efficient Action Recognition
Swin-S (ImageNet-1k pretrain)80.6Video Swin Transformer
OMNIVORE (Swin-B)84.0Omnivore: A Single Model for Many Visual Modalities
S3D-G (Flow, ImageNet pretrained)68Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
0 of 204 row(s) selected.