HyperAI
HyperAI
Home
Console
Docs
News
Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
Terms of Service
Privacy Policy
English
HyperAI
HyperAI
Toggle Sidebar
Search the site…
⌘
K
Command Palette
Search for a command to run...
Console
Home
SOTA
Action Recognition
Action Recognition In Videos On Something
Action Recognition In Videos On Something
Metrics
Top-1 Accuracy
Top-5 Accuracy
Results
Performance results of various models on this benchmark
Columns
Model Name
Top-1 Accuracy
Top-5 Accuracy
Paper Title
MVD (Kinetics400 pretrain, ViT-H, 16 frame)
77.3
95.7
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
InternVideo
77.2
-
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
InternVideo2-1B
77.1
-
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
VideoMAE V2-g
77.0
95.9
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
MVD (Kinetics400 pretrain, ViT-L, 16 frame)
76.7
95.5
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
Hiera-L (no extra data)
76.5
-
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
TubeViT-L
76.1
95.2
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
VideoMAE (no extra data, ViT-L, 32x2)
75.4
95.2
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Side4Video (EVA ViT-E/14)
75.2
94.0
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
MaskFeat (Kinetics600 pretrain, MViT-L)
75.0
95.0
Masked Feature Prediction for Self-Supervised Visual Pre-Training
MAR (50% mask, ViT-L, 16x4)
74.7
94.9
MAR: Masked Autoencoders for Efficient Action Recognition
ATM
74.6
94.4
What Can Simple Arithmetic Operations Do for Temporal Modeling?
MAWS (ViT-L)
74.4
-
The effectiveness of MAE pre-pretraining for billion-scale pretraining
VideoMAE (no extra data, ViT-L, 16frame)
74.3
94.6
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
MAR (75% mask, ViT-L, 16x4)
73.8
94.4
MAR: Masked Autoencoders for Efficient Action Recognition
ViC-MAE (ViT-L)
73.7
-
ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders
MVD (Kinetics400 pretrain, ViT-B, 16 frame)
73.7
94.0
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
TAdaFormer-L/14
73.6
-
Temporally-Adaptive Models for Efficient Video Understanding
TDS-CLIP-ViT-L/14(8frames)
73.4
93.8
TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning
AMD(ViT-B/16)
73.3
94.0
Asymmetric Masked Distillation for Pre-Training Small Foundation Models
0 of 122 row(s) selected.
Previous
Next
Action Recognition In Videos On Something | SOTA | HyperAI