HyperAI

Action Recognition On Epic Kitchens 100

Métriques

Action@1
GFLOPs
Noun@1
Verb@1

Résultats

Résultats de performance de divers modèles sur ce benchmark

Nom du modèle
Action@1
GFLOPs
Noun@1
Verb@1
Paper TitleRepository
MoViNet-A544.574.9x155.169.1MoViNets: Mobile Video Networks for Efficient Video Recognition
Avion (ViT-L)54.4-65.473.0Training a Large Video Model on a Single Machine in a Day
MeMViT-2448.4-60.371.4MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition
SlowFast36.81---Rescaling Egocentric Vision
MoViNet-A241.27.59x152.367.1MoViNets: Mobile Video Networks for Efficient Video Recognition
TSN33.57---Rescaling Egocentric Vision
GSF44.48-53.1869.06Gate-Shift-Fuse for Video Action Recognition
ViViT-L/16x2 Fact. encoder44.0-56.866.4ViViT: A Video Vision Transformer
TAdaConvNeXtV2-S48.9-60.271.0Temporally-Adaptive Models for Efficient Video Understanding
ORViT Mformer-L (ORViT blocks)45.7-58.768.4Object-Region Video Transformers
CAST-B/1649.3-60.972.5CAST: Cross-Attention in Space and Time for Video Action Recognition
Mformer-HR44.5-58.567.0Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
TempAgg45.26-53.3566Technical Report: Temporal Aggregate Representations
LaViLa (TimeSformer-L)51-62.972Learning Video Representations from Large Language Models
MMT47.8-61.070.1Multiscale Multimodal Transformer for Multimodal Action Recognition-
MBT43.4-5864.8Attention Bottlenecks for Multimodal Fusion
Mformer-L44.1-57.667.1Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
Mformer43.1-56.566.7Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
M&M (WTS 60M)53.6-66.372.0M&M Mix: A Multimodal Multiview Transformer Ensemble-
OMNIVORE (Swin-B, finetuned)49.9-61.769.5Omnivore: A Single Model for Many Visual Modalities
0 of 30 row(s) selected.