VideoMAE (K700 pretrain, ViT-L, 16x4) | 36.1 | VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | - |
VideoMAE (K400 pretrain, ViT-B, 16x4) | 26.7 | VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | - |
MaskFeat (Kinetics-600 pretrain, MViT-L) | 39.8 | Masked Feature Prediction for Self-Supervised Visual Pre-Training | |
MViTv2-L (IN21k, K700) | 34.4 | MViTv2: Improved Multiscale Vision Transformers for Classification and Detection | |
VideoMAE (K400 pretrain+finetune, ViT-H, 16x4) | 39.5 | VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | - |
MVD (Kinetics400 pretrain, ViT-B, 16x4) | 31.1 | Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | |
MViT-B, 64x3 (Kinetics-400 pretraining) | 27.3 | Multiscale Vision Transformers | |
InternVideo | 41.01 | InternVideo: General Video Foundation Models via Generative and Discriminative Learning | |
MVD (Kinetics400 pretrain+finetune, ViT-L, 16x4) | 38.7 | Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | |
VideoMAE (K400 pretrain, ViT-L, 16x4) | 34.3 | VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | - |
VideoMAE (K700 pretrain+finetune, ViT-L, 16x4) | 39.3 | VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | - |
MVD (Kinetics400 pretrain+finetune, ViT-B, 16x4) | 34.2 | Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | |
SlowFast, 4x16, R50 (Kinetics-400 pretraining) | 21.9 | SlowFast Networks for Video Recognition | |
Hiera-H (K700 PT+FT) | 43.3 | Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles | |
AMD(ViT-B/16) | 33.5 | Asymmetric Masked Distillation for Pre-Training Small Foundation Models | - |
HIT | 32.6 | Holistic Interaction Transformer Network for Action Detection | |
SlowFast, 8x8, R101 (Kinetics-400 pretraining) | 23.8 | SlowFast Networks for Video Recognition | |
STAR/L | 41.7 | End-to-End Spatio-Temporal Action Localisation with Video Transformers | - |
MVD (Kinetics400 pretrain, ViT-L, 16x4) | 37.7 | Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning | |
SlowFast, 16x8 R101+NL (Kinetics-600 pretraining) | 27.5 | SlowFast Networks for Video Recognition | |