HyperAI
Home
News
Latest Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
English
HyperAI
Toggle sidebar
Search the site…
⌘
K
Home
SOTA
Action Classification
Action Classification On Kinetics 700
Action Classification On Kinetics 700
Metrics
Top-1 Accuracy
Top-5 Accuracy
Results
Performance results of various models on this benchmark
Columns
Model Name
Top-1 Accuracy
Top-5 Accuracy
Paper Title
Repository
SRTG r(2+1)d-34
49.43
73.23
Learn to cycle: Time-consistent feature discovery for action recognition
MViTv2-B
76.6
93.2
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
SRTG r3d-50
53.52
74.17
Learn to cycle: Time-consistent feature discovery for action recognition
MoViNet-A1
63.5
-
MoViNets: Mobile Video Networks for Efficient Video Recognition
MoViNet-A2
66.7
-
MoViNets: Mobile Video Networks for Efficient Video Recognition
MoViNet-A3
68.0
-
MoViNets: Mobile Video Networks for Efficient Video Recognition
VidTr-M
69.5
88.3
VidTr: Video Transformer Without Convolutions
-
InternVideo-T
84.0
-
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
EVA
82.9%
-
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
MoViNet-A4
70.7
-
MoViNets: Mobile Video Networks for Efficient Video Recognition
UniFormerV2-L
82.7
96.2
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
MaskFeat (no extra data, MViT-L)
80.4
95.7
Masked Feature Prediction for Self-Supervised Visual Pre-Training
InternVideo2-1B
85.4
-
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
SRTG r3d-101
56.46
76.82
Learn to cycle: Time-consistent feature discovery for action recognition
VidTr-L
70.2
89
VidTr: Video Transformer Without Convolutions
-
SRTG r3d-34
49.15
72.68
Learn to cycle: Time-consistent feature discovery for action recognition
mPLUG-2
80.4
94.9
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
UMT-L (ViT-L/16)
83.6
96.7
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
CoVeR (JFT-3B)
79.8
94.9
Co-training Transformer with Videos and Images Improves Action Recognition
-
MViTv2-L (ImageNet-21k pretrain)
79.4
94.9
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
0 of 36 row(s) selected.
Previous
Next