Action Recognition In Videos On Something

Metrics

Top-1 Accuracy

Top-5 Accuracy

Results

Performance results of various models on this benchmark

			Paper Title
MVD (Kinetics400 pretrain, ViT-H, 16 frame)	77.3	95.7	Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
InternVideo	77.2	-	InternVideo: General Video Foundation Models via Generative and Discriminative Learning
InternVideo2-1B	77.1	-	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
VideoMAE V2-g	77.0	95.9	VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
MVD (Kinetics400 pretrain, ViT-L, 16 frame)	76.7	95.5	Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
Hiera-L (no extra data)	76.5	-	Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
TubeViT-L	76.1	95.2	Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
VideoMAE (no extra data, ViT-L, 32x2)	75.4	95.2	VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Side4Video (EVA ViT-E/14)	75.2	94.0	Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
MaskFeat (Kinetics600 pretrain, MViT-L)	75.0	95.0	Masked Feature Prediction for Self-Supervised Visual Pre-Training
MAR (50% mask, ViT-L, 16x4)	74.7	94.9	MAR: Masked Autoencoders for Efficient Action Recognition
ATM	74.6	94.4	What Can Simple Arithmetic Operations Do for Temporal Modeling?
MAWS (ViT-L)	74.4	-	The effectiveness of MAE pre-pretraining for billion-scale pretraining
VideoMAE (no extra data, ViT-L, 16frame)	74.3	94.6	VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
MAR (75% mask, ViT-L, 16x4)	73.8	94.4	MAR: Masked Autoencoders for Efficient Action Recognition
ViC-MAE (ViT-L)	73.7	-	ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders
MVD (Kinetics400 pretrain, ViT-B, 16 frame)	73.7	94.0	Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
TAdaFormer-L/14	73.6	-	Temporally-Adaptive Models for Efficient Video Understanding
TDS-CLIP-ViT-L/14(8frames)	73.4	93.8	TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning
AMD(ViT-B/16)	73.3	94.0	Asymmetric Masked Distillation for Pre-Training Small Foundation Models

0 of 122 row(s) selected.

Command Palette

Action Recognition In Videos On Something

Metrics

Results