Action Classification On Kinetics 600

평가 지표

Top-1 Accuracy

평가 결과

이 벤치마크에서 각 모델의 성능 결과

		Paper Title
InternVideo2-6B	91.9	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
TubeVit-H	91.8	Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
InternVideo2-1B	91.6	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
TubeVit-L	91.5	Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
InternVideo-T	91.3	InternVideo: General Video Foundation Models via Generative and Discriminative Learning
모델 45	91.1	MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
TubeVit-B	90.9	Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
UMT-L (ViT-L/16)	90.5	Unmasked Teacher: Towards Training-Efficient Video Foundation Models
MTV-H (WTS 60M)	90.3	Multiview Transformers for Video Recognition
UniFormerV2-L	90.1	UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
VideoMAE V2-g (64x266x266)	89.9	VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
mPLUG-2	89.8	mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
EVA	89.8%	EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
모델 11	89.7	MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
CoCa (finetuned)	89.4	CoCa: Contrastive Captioners are Image-Text Foundation Models
모델 55	89.4	MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
VideoMAE V2-g	88.8	VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Hiera-H (no extra data)	88.8	Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
CoCa (frozen)	88.5	CoCa: Contrastive Captioners are Image-Text Foundation Models
X-CLIP(ViT-L/14, CLIP)	88.3	Expanding Language-Image Pretrained Models for General Video Recognition

0 of 65 row(s) selected.

Command Palette

Action Classification On Kinetics 600

평가 지표

평가 결과