HyperAI超神经

Action Classification On Kinetics 600

评估指标

Top-1 Accuracy

评测结果

各个模型在此基准测试上的表现结果

比较表格
模型名称Top-1 Accuracy
d3d-distilled-3d-networks-for-video-action79.1
space-time-mixing-attention-for-video84.5
movinets-mobile-video-networks-for-efficient82.7
videomae-v2-scaling-video-masked-autoencoders88.8
movinets-mobile-video-networks-for-efficient77.5
perf-net-pose-empowered-rgb-flow-net82.0
mplug-2-a-modularized-multi-modal-foundation89.8
florence-a-new-foundation-model-for-computer87.8
movinets-mobile-video-networks-for-efficient83.5
rethinking-spatiotemporal-feature-learning76.6
merlot-reserve-neural-script-knowledge89.7
uniformer-unified-transformer-for-efficient84.8
learning-spatio-temporal-representation-with-375
tokenlearner-what-can-8-learned-tokens-do-for86.3
slowfast-networks-for-video-recognition79.9
unmasked-teacher-towards-training-efficient90.5
slowfast-networks-for-video-recognition81.8
eva-exploring-the-limits-of-masked-visual89.8%
a-short-note-about-kinetics-60073.6
rethinking-video-vits-sparse-video-tubes-for91.5
2103-1569183.0
internvideo-general-video-foundation-models91.3
rethinking-video-vits-sparse-video-tubes-for90.9
coca-contrastive-captioners-are-image-text89.4
movinets-mobile-video-networks-for-efficient76.0
learning-spatio-temporal-representation-with-381.5
revisiting-3d-resnets-for-video-recognition83.1
co-training-transformer-with-videos-and86.8
improved-multiscale-vision-transformers-for87.9
learning-spatio-temporal-representation-with-383.1
movinets-mobile-video-networks-for-efficient81.2
improved-multiscale-vision-transformers-for-
improved-multiscale-vision-transformers-for85.5
slowfast-networks-for-video-recognition81.1
movinets-mobile-video-networks-for-efficient84.3
expanding-language-image-pretrained-models88.3
d3d-distilled-3d-networks-for-video-action77.9
slowfast-networks-for-video-recognition80.4
movinets-mobile-video-networks-for-efficient71.5
multiscale-vision-transformers82.1
slowfast-networks-for-video-recognition78.8
internvideo2-scaling-video-foundation-models91.6
videomae-v2-scaling-video-masked-autoencoders89.9
coca-contrastive-captioners-are-image-text88.5
merlot-reserve-neural-script-knowledge91.1
2103-1569185.8
hiera-a-hierarchical-vision-transformer88.8
uniformerv2-spatiotemporal-learning-by-arming90.1
2103-1569184.3
multiview-transformers-for-video-recognition90.3
vatt-transformers-for-multimodal-self83.6
co-training-transformer-with-videos-and87.9
multiscale-vision-transformers83.8
movinets-mobile-video-networks-for-efficient80.8
merlot-reserve-neural-script-knowledge89.4
video-swin-transformer86.1
multiscale-vision-transformers83.4
rethinking-spatiotemporal-feature-learning78.6
internvideo2-scaling-video-foundation-models91.9
rethinking-spatiotemporal-feature-learning69.7
video-swin-transformer84.0
rethinking-video-vits-sparse-video-tubes-for91.8
masked-feature-prediction-for-self-supervised88.3
merlot-reserve-neural-script-knowledge88.1
improved-multiscale-vision-transformers-for-