ViT-B-VTN+ ImageNet-21K (84.0 [10]) | 79.8 | Video Transformer Network | |
SlowFast 16x8 (ResNet-101) | 78.9 | SlowFast Networks for Video Recognition | |
R[2+1]D-RGB (Sports-1M pretrain) | 74.3 | A Closer Look at Spatiotemporal Convolutions for Action Recognition | |
ip-CSN-152 (IG-65M pretraining) | 82.5 | Video Classification with Channel-Separated Convolutional Networks | |
MARS+RGB+Flow (64 frames) | 74.9 | MARS: Motion-Augmented RGB Stream for Action Recognition | - |
MAR (50% mask, ViT-B, 16x4) | 81.0 | MAR: Masked Autoencoders for Efficient Action Recognition | |
Swin-S (ImageNet-1k pretrain) | 80.6 | Video Swin Transformer | |