MVFNet-ResNet50 (center crop, 8+16 ensemble, ImageNet pretrained, RGB only) | 66.3 | - | MVFNet: Multi-View Fusion Network for Efficient Video Recognition | - |
VideoMAE (no extra data, ViT-B, 16frame) | 70.8 | 92.4 | VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | - |
MSNet-R50En (8+16 ensemble, ImageNet pretrained) | 66.6 | 90.6 | MotionSqueeze: Neural Motion Feature Learning for Video Understanding | |
TAda2D (ResNet-50, 8 frames) | 64.0 | 88.0 | TAda! Temporally-Adaptive Convolutions for Video Understanding | |
VideoMAE (no extra data, ViT-L, 32x2) | 75.4 | 95.2 | VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | - |
PAN ResNet101 (RGB only, no Flow) | 66.5 | 90.6 | PAN: Towards Fast Action Recognition via Learning Persistence of Appearance | |
ORViT Mformer (ORViT blocks) | 67.9 | 90.5 | Object-Region Video Transformers | |
UniFormer-B (IN-1K + Kinetics400 pretrain) | 71.2 | 92.8 | UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning | |