HyperAI
HyperAI
Home
Console
Docs
News
Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
Terms of Service
Privacy Policy
English
HyperAI
HyperAI
Toggle Sidebar
Search the site…
⌘
K
Command Palette
Search for a command to run...
Console
Home
SOTA
Action Recognition
Action Recognition On Epic Kitchens 100
Action Recognition On Epic Kitchens 100
Metrics
Action@1
GFLOPs
Noun@1
Verb@1
Results
Performance results of various models on this benchmark
Columns
Model Name
Action@1
GFLOPs
Noun@1
Verb@1
Paper Title
Avion (ViT-L)
54.4
-
65.4
73.0
Training a Large Video Model on a Single Machine in a Day
M&M (WTS 60M)
53.6
-
66.3
72.0
M&M Mix: A Multimodal Multiview Transformer Ensemble
LVMAE
52.1
-
61.8
75.0
Extending Video Masked Autoencoders to 128 frames
TAdaFormer-L/14
51.8
-
64.1
71.7
Temporally-Adaptive Models for Efficient Video Understanding
LaViLa (TimeSformer-L)
51
-
62.9
72
Learning Video Representations from Large Language Models
MTV-B (WTS 60M)
50.5
-
63.9
69.9
Multiview Transformers for Video Recognition
OMNIVORE (Swin-B, finetuned)
49.9
-
61.7
69.5
Omnivore: A Single Model for Many Visual Modalities
CAST-B/16
49.3
-
60.9
72.5
CAST: Cross-Attention in Space and Time for Video Action Recognition
TAdaConvNeXtV2-S
48.9
-
60.2
71.0
Temporally-Adaptive Models for Efficient Video Understanding
MeMViT-24
48.4
-
60.3
71.4
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition
MMT
47.8
-
61.0
70.1
Multiscale Multimodal Transformer for Multimodal Action Recognition
MoViNet-A6
47.7
117x1
57.3
72.2
MoViNets: Mobile Video Networks for Efficient Video Recognition
AVT
47.2
-
59.3
70.4
AVT: Audio-Video Transformer for Multimodal Action Recognition
ORViT Mformer-L (ORViT blocks)
45.7
-
58.7
68.4
Object-Region Video Transformers
TempAgg
45.26
-
53.35
66
Technical Report: Temporal Aggregate Representations
MoViNet-A5
44.5
74.9x1
55.1
69.1
MoViNets: Mobile Video Networks for Efficient Video Recognition
Mformer-HR
44.5
-
58.5
67.0
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
GSF
44.48
-
53.18
69.06
Gate-Shift-Fuse for Video Action Recognition
MoViNet-A4
44.4
42.2x1
56.2
68.8
MoViNets: Mobile Video Networks for Efficient Video Recognition
Mformer-L
44.1
-
57.6
67.1
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
0 of 30 row(s) selected.
Previous
Next
Action Recognition On Epic Kitchens 100 | SOTA | HyperAI