VoV3D-L (16frames, from scratch, single) | 9.3x6 | 5.8M | 49.5 | 78.0 | Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification | |
ip-CSN-152 (IG-65M pretraining) | - | - | 53.3 | - | Video Classification with Channel-Separated Convolutional Networks | |
HF-TSN (ImageNet pretraining) | - | - | 41.97 | - | Hierarchical Feature Aggregation Networks for Video Action Recognition | - |
ECO-Net (ImageNet pretrained) | - | - | 46.4 | - | ECO: Efficient Convolutional Network for Online Video Understanding | |
VoV3D-M (32frames, from scratch, single) | 11.5x6 | 3.3M | 49.8 | 78.0 | Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification | |
VoV3D-L (32frames, Kinetics pretrained, single) | 20.9x6 | 5.8M | 54.59 | 82.30 | Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification | |
MARS+RGB+Flow (16 frames, Kinetics pretrained) | - | - | 40.4 | - | MARS: Motion-Augmented RGB Stream for Action Recognition | - |
EAN ResNet50 (single clip, center crop,8+16 ensemble, with sparse Transformer) | - | - | 57.2 | 83.9 | EAN: Event Adaptive Network for Enhanced Action Recognition | |
ResNet50 I3D (Kinetics pretrained) | - | - | 48.6 | - | Moments in Time Dataset: one million videos for event understanding | |
SELFYNet-TSM-R50 (16 frames, ImageNet pretrained) | - | - | 54.3 | 82.9 | Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition | |
CT-Net Ensemble (R50, 8+12+16+24) | - | - | 56.6 | - | CT-Net: Channel Tensorization Network for Video Classification | |