SlowFast | 36.81 | - | - | - | Rescaling Egocentric Vision | |
ViViT-L/16x2 Fact. encoder | 44.0 | - | 56.8 | 66.4 | ViViT: A Video Vision Transformer | |
ORViT Mformer-L (ORViT blocks) | 45.7 | - | 58.7 | 68.4 | Object-Region Video Transformers | |
OMNIVORE (Swin-B, finetuned) | 49.9 | - | 61.7 | 69.5 | Omnivore: A Single Model for Many Visual Modalities | |