Action Recognition On Epic Kitchens 100

평가 지표

Action@1

GFLOPs

Noun@1

Verb@1

평가 결과

이 벤치마크에서 각 모델의 성능 결과

					Paper Title
Avion (ViT-L)	54.4	-	65.4	73.0	Training a Large Video Model on a Single Machine in a Day
M&M (WTS 60M)	53.6	-	66.3	72.0	M&M Mix: A Multimodal Multiview Transformer Ensemble
LVMAE	52.1	-	61.8	75.0	Extending Video Masked Autoencoders to 128 frames
TAdaFormer-L/14	51.8	-	64.1	71.7	Temporally-Adaptive Models for Efficient Video Understanding
LaViLa (TimeSformer-L)	51	-	62.9	72	Learning Video Representations from Large Language Models
MTV-B (WTS 60M)	50.5	-	63.9	69.9	Multiview Transformers for Video Recognition
OMNIVORE (Swin-B, finetuned)	49.9	-	61.7	69.5	Omnivore: A Single Model for Many Visual Modalities
CAST-B/16	49.3	-	60.9	72.5	CAST: Cross-Attention in Space and Time for Video Action Recognition
TAdaConvNeXtV2-S	48.9	-	60.2	71.0	Temporally-Adaptive Models for Efficient Video Understanding
MeMViT-24	48.4	-	60.3	71.4	MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition
MMT	47.8	-	61.0	70.1	Multiscale Multimodal Transformer for Multimodal Action Recognition
MoViNet-A6	47.7	117x1	57.3	72.2	MoViNets: Mobile Video Networks for Efficient Video Recognition
AVT	47.2	-	59.3	70.4	AVT: Audio-Video Transformer for Multimodal Action Recognition
ORViT Mformer-L (ORViT blocks)	45.7	-	58.7	68.4	Object-Region Video Transformers
TempAgg	45.26	-	53.35	66	Technical Report: Temporal Aggregate Representations
MoViNet-A5	44.5	74.9x1	55.1	69.1	MoViNets: Mobile Video Networks for Efficient Video Recognition
Mformer-HR	44.5	-	58.5	67.0	Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
GSF	44.48	-	53.18	69.06	Gate-Shift-Fuse for Video Action Recognition
MoViNet-A4	44.4	42.2x1	56.2	68.8	MoViNets: Mobile Video Networks for Efficient Video Recognition
Mformer-L	44.1	-	57.6	67.1	Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

0 of 30 row(s) selected.

Command Palette

Action Recognition On Epic Kitchens 100

평가 지표

평가 결과