HyperAI

Action Recognition On Ava V2 2

Métriques

mAP

Résultats

Résultats de performance de divers modèles sur ce benchmark

Nom du modèle
mAP
Paper TitleRepository
VideoMAE (K700 pretrain, ViT-L, 16x4)36.1VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training-
VideoMAE (K400 pretrain, ViT-B, 16x4)26.7VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training-
MaskFeat (Kinetics-600 pretrain, MViT-L)39.8Masked Feature Prediction for Self-Supervised Visual Pre-Training
MViTv2-L (IN21k, K700)34.4MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
VideoMAE (K400 pretrain+finetune, ViT-H, 16x4)39.5VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training-
MVD (Kinetics400 pretrain, ViT-B, 16x4)31.1Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
MViT-B, 64x3 (Kinetics-400 pretraining)27.3Multiscale Vision Transformers
InternVideo41.01InternVideo: General Video Foundation Models via Generative and Discriminative Learning
MVD (Kinetics400 pretrain+finetune, ViT-L, 16x4)38.7Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
VideoMAE (K400 pretrain, ViT-L, 16x4)34.3VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training-
VideoMAE (K700 pretrain+finetune, ViT-L, 16x4)39.3VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training-
MVD (Kinetics400 pretrain+finetune, ViT-B, 16x4)34.2Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
SlowFast, 4x16, R50 (Kinetics-400 pretraining)21.9SlowFast Networks for Video Recognition
Hiera-H (K700 PT+FT)43.3Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
AMD(ViT-B/16)33.5Asymmetric Masked Distillation for Pre-Training Small Foundation Models-
HIT32.6Holistic Interaction Transformer Network for Action Detection
SlowFast, 8x8, R101 (Kinetics-400 pretraining)23.8SlowFast Networks for Video Recognition
STAR/L41.7End-to-End Spatio-Temporal Action Localisation with Video Transformers-
MVD (Kinetics400 pretrain, ViT-L, 16x4)37.7Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
SlowFast, 16x8 R101+NL (Kinetics-600 pretraining)27.5SlowFast Networks for Video Recognition
0 of 38 row(s) selected.