Home News Latest Papers Tutorials Datasets Wiki SOTA LLM Models GPU Leaderboard Events

English

Action Recognition On Diving 48

Metrics

Accuracy

Results

Performance results of various models on this benchmark

Model Name	Accuracy	Paper Title	Repository
ORViT TimeSformer	88.0	Object-Region Video Transformers
VIMPAC	85.5	VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning
SlowFast	77.6	SlowFast Networks for Video Recognition
LVMAE	94.9	Extending Video Masked Autoencoders to 128 frames	-
StructVit-B-4-1	88.3	Learning Correlation Structures for Vision Transformers	-
TimeSformer	75	Is Space-Time Attention All You Need for Video Understanding?
DUALPATH	88.7	Dual-path Adaptation from Image to Video Transformers
TimeSformer-HR	78	Is Space-Time Attention All You Need for Video Understanding?
TFCNet	88.3	TFCNet: Temporal Fully Connected Networks for Static Unbiased Temporal Reasoning	-
Video-FocalNet-B	90.8	Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition
AIM (CLIP ViT-L/14, 32x224)	90.6	AIM: Adapting Image Models for Efficient Video Action Recognition
RSANet-R50 (16 frames, ImageNet pretrained, a single clip)	84.2	Relational Self-Attention: What's Missing in Attention for Video Understanding
GC-TDN	87.6	Group Contextualization for Video Recognition
BEVT	86.7	BEVT: BERT Pretraining of Video Transformers
PMI Sampler	81.3	PMI Sampler: Patch Similarity Guided Frame Selection for Aerial Action Recognition
TQN	81.8	Temporal Query Networks for Fine-grained Video Understanding	-
TimeSformer-L	81	Is Space-Time Attention All You Need for Video Understanding?
PSB	86	Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition

0 of 18 row(s) selected.