Action Recognition In Videos On Something

評価指標

Top-1 Accuracy

Top-5 Accuracy

評価結果

このベンチマークにおける各モデルのパフォーマンス結果

モデル名	Top-1 Accuracy	Top-5 Accuracy	Paper Title	Repository
TRG (ResNet-50)	62.2	90.3	Temporal Reasoning Graph for Activity Recognition	-
MVFNet-ResNet50 (center crop, 8+16 ensemble, ImageNet pretrained, RGB only)	66.3	-	MVFNet: Multi-View Fusion Network for Efficient Video Recognition
Mformer-L	68.1	91.2	Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
ViC-MAE (ViT-L)	73.7	-	ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders
bLVNet	65.2	-	More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation
TPS	69.8	-	Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition
VideoMAE (no extra data, ViT-B, 16frame)	70.8	92.4	VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
MSNet-R50En (8+16 ensemble, ImageNet pretrained)	66.6	90.6	MotionSqueeze: Neural Motion Feature Learning for Video Understanding
TAda2D (ResNet-50, 8 frames)	64.0	88.0	TAda! Temporally-Adaptive Convolutions for Video Understanding
VideoMAE (no extra data, ViT-L, 32x2)	75.4	95.2	VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
PAN ResNet101 (RGB only, no Flow)	66.5	90.6	PAN: Towards Fast Action Recognition via Learning Persistence of Appearance
MoViNet-A0	61.3	88.2	MoViNets: Mobile Video Networks for Efficient Video Recognition
Mformer-HR	67.1	90.6	Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
ORViT Mformer (ORViT blocks)	67.9	90.5	Object-Region Video Transformers
SVT	59.2	-	Self-supervised Video Transformer
STM + TRNMultiscale	47.73	-	Comparative Analysis of CNN-based Spatiotemporal Reasoning in Videos
TimeSformer-L	62.3	-	Is Space-Time Attention All You Need for Video Understanding?
MML (single)	66.83	91.30	Mutual Modality Learning for Video Action Classification
AMD(ViT-S/16)	70.2	92.5	Asymmetric Masked Distillation for Pre-Training Small Foundation Models	-
UniFormer-B (IN-1K + Kinetics400 pretrain)	71.2	92.8	UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning	-

0 of 122 row(s) selected.