HyperAI
HyperAI초신경
홈
플랫폼
문서
뉴스
연구 논문
튜토리얼
데이터셋
백과사전
SOTA
LLM 모델
GPU 랭킹
컨퍼런스
전체 검색
소개
서비스 약관
개인정보 처리방침
한국어
HyperAI
HyperAI초신경
Toggle Sidebar
전체 사이트 검색...
⌘
K
Command Palette
Search for a command to run...
플랫폼
홈
SOTA
행동 분류
Action Classification On Kinetics 600
Action Classification On Kinetics 600
평가 지표
Top-1 Accuracy
평가 결과
이 벤치마크에서 각 모델의 성능 결과
Columns
모델 이름
Top-1 Accuracy
Paper Title
InternVideo2-6B
91.9
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
TubeVit-H
91.8
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
InternVideo2-1B
91.6
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
TubeVit-L
91.5
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
InternVideo-T
91.3
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
모델 45
91.1
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
TubeVit-B
90.9
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
UMT-L (ViT-L/16)
90.5
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
MTV-H (WTS 60M)
90.3
Multiview Transformers for Video Recognition
UniFormerV2-L
90.1
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
VideoMAE V2-g (64x266x266)
89.9
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
mPLUG-2
89.8
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
EVA
89.8%
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
모델 11
89.7
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
CoCa (finetuned)
89.4
CoCa: Contrastive Captioners are Image-Text Foundation Models
모델 55
89.4
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
VideoMAE V2-g
88.8
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Hiera-H (no extra data)
88.8
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
CoCa (frozen)
88.5
CoCa: Contrastive Captioners are Image-Text Foundation Models
X-CLIP(ViT-L/14, CLIP)
88.3
Expanding Language-Image Pretrained Models for General Video Recognition
0 of 65 row(s) selected.
Previous
Next
Action Classification On Kinetics 600 | SOTA | HyperAI초신경