HyperAI
HyperAI超神経
ホーム
プラットフォーム
ドキュメント
ニュース
論文
チュートリアル
データセット
百科事典
SOTA
LLMモデル
GPU ランキング
学会
検索
サイトについて
利用規約
プライバシーポリシー
日本語
HyperAI
HyperAI超神経
Toggle Sidebar
サイトを検索…
⌘
K
Command Palette
Search for a command to run...
プラットフォーム
ホーム
SOTA
アクション分類
Action Classification On Kinetics 700
Action Classification On Kinetics 700
評価指標
Top-1 Accuracy
Top-5 Accuracy
評価結果
このベンチマークにおける各モデルのパフォーマンス結果
Columns
モデル名
Top-1 Accuracy
Top-5 Accuracy
Paper Title
InternVideo2-6B
85.9
-
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
InternVideo2-1B
85.4
-
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
InternVideo-T
84.0
-
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
TubeViT-L
83.8
96.6
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
UMT-L (ViT-L/16)
83.6
96.7
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
MTV-H (WTS 60M)
83.4
96.2
Multiview Transformers for Video Recognition
EVA
82.9%
-
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
UniFormerV2-L
82.7
96.2
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
CoCa (finetuned)
82.7
-
CoCa: Contrastive Captioners are Image-Text Foundation Models
CoCa (frozen)
81.1
-
CoCa: Contrastive Captioners are Image-Text Foundation Models
Hiera-H (no extra data)
81.1
-
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
MaskFeat (no extra data, MViT-L)
80.4
95.7
Masked Feature Prediction for Self-Supervised Visual Pre-Training
mPLUG-2
80.4
94.9
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
AIM (CLIP ViT-L/14, 32x224)
80.4
-
AIM: Adapting Image Models for Efficient Video Action Recognition
CoVeR (JFT-3B)
79.8
94.9
Co-training Transformer with Videos and Images Improves Action Recognition
MViTv2-L (ImageNet-21k pretrain)
79.4
94.9
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
MoViNet-A6
79.4
-
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
CoVeR (JFT-300M)
78.5
94.2
Co-training Transformer with Videos and Images Improves Action Recognition
MViTv2-B
76.6
93.2
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
MoViNet-A6
72.3
-
MoViNets: Mobile Video Networks for Efficient Video Recognition
0 of 36 row(s) selected.
Previous
Next
Action Classification On Kinetics 700 | SOTA | HyperAI超神経