HyperAI
Startseite
Neuigkeiten
Neueste Forschungsarbeiten
Tutorials
Datensätze
Wiki
SOTA
LLM-Modelle
GPU-Rangliste
Veranstaltungen
Suche
Über
Deutsch
HyperAI
Toggle sidebar
Seite durchsuchen…
⌘
K
Startseite
SOTA
Action Recognition In Videos
Action Recognition On Epic Kitchens 100
Action Recognition On Epic Kitchens 100
Metriken
Action@1
GFLOPs
Noun@1
Verb@1
Ergebnisse
Leistungsergebnisse verschiedener Modelle zu diesem Benchmark
Columns
Modellname
Action@1
GFLOPs
Noun@1
Verb@1
Paper Title
Repository
MoViNet-A5
44.5
74.9x1
55.1
69.1
MoViNets: Mobile Video Networks for Efficient Video Recognition
Avion (ViT-L)
54.4
-
65.4
73.0
Training a Large Video Model on a Single Machine in a Day
MeMViT-24
48.4
-
60.3
71.4
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition
SlowFast
36.81
-
-
-
Rescaling Egocentric Vision
MoViNet-A2
41.2
7.59x1
52.3
67.1
MoViNets: Mobile Video Networks for Efficient Video Recognition
TSN
33.57
-
-
-
Rescaling Egocentric Vision
GSF
44.48
-
53.18
69.06
Gate-Shift-Fuse for Video Action Recognition
ViViT-L/16x2 Fact. encoder
44.0
-
56.8
66.4
ViViT: A Video Vision Transformer
TAdaConvNeXtV2-S
48.9
-
60.2
71.0
Temporally-Adaptive Models for Efficient Video Understanding
ORViT Mformer-L (ORViT blocks)
45.7
-
58.7
68.4
Object-Region Video Transformers
CAST-B/16
49.3
-
60.9
72.5
CAST: Cross-Attention in Space and Time for Video Action Recognition
Mformer-HR
44.5
-
58.5
67.0
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
TempAgg
45.26
-
53.35
66
Technical Report: Temporal Aggregate Representations
LaViLa (TimeSformer-L)
51
-
62.9
72
Learning Video Representations from Large Language Models
MMT
47.8
-
61.0
70.1
Multiscale Multimodal Transformer for Multimodal Action Recognition
-
MBT
43.4
-
58
64.8
Attention Bottlenecks for Multimodal Fusion
Mformer-L
44.1
-
57.6
67.1
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
Mformer
43.1
-
56.5
66.7
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
M&M (WTS 60M)
53.6
-
66.3
72.0
M&M Mix: A Multimodal Multiview Transformer Ensemble
-
OMNIVORE (Swin-B, finetuned)
49.9
-
61.7
69.5
Omnivore: A Single Model for Many Visual Modalities
0 of 30 row(s) selected.
Previous
Next