HyperAI

Action Classification On Kinetics 400

Metrics

Acc@1

Results

Performance results of various models on this benchmark

Comparison Table
Model NameAcc@1
omnivec-learning-robust-representations-with91.1
x3d-expanding-architectures-for-efficient77.5
video-transformer-network79.8
multiscale-vision-transformers80.2
multiview-transformers-for-video-recognition89.9
adamae-adaptive-masking-for-efficient81.7
visual-representation-learning-from-unlabeled85.1
movinets-mobile-video-networks-for-efficient80.5
slowfast-networks-for-video-recognition78.9
a-closer-look-at-spatiotemporal-convolutions74.3
expanding-language-image-pretrained-models87.7
video-classification-with-channel-separated82.5
mars-motion-augmented-rgb-stream-for-action74.9
tokenlearner-what-can-8-learned-tokens-do-for85.4
videomamba-state-space-model-for-efficient85.0
tada-temporally-adaptive-convolutions-for-179.1
mar-masked-autoencoders-for-efficient-action81.0
video-swin-transformer80.6
omnivore-a-single-model-for-many-visual84.0
rethinking-spatiotemporal-feature-learning68
co-training-transformer-with-videos-and87.2
d3d-distilled-3d-networks-for-video-action75.9
what-makes-training-multi-modal-networks-hard78.9
continual-3d-convolutional-neural-networks59.58
implicit-temporal-modeling-with-learnable82.4
attention-bottlenecks-for-multimodal-fusion80.8
mlp-3d-a-mlp-like-3d-architecture-with-181.4
video-classification-with-channel-separated81.3
tada-temporally-adaptive-convolutions-for-176.7
self-supervised-video-transformer78.1
direcformer-a-directed-attention-in82.75
mars-motion-augmented-rgb-stream-for-action68.9
am-flow-adapters-for-temporal-processing-in89.6
continual-3d-convolutional-neural-networks67.06
victr-video-conditioned-text-representations87.0
transferring-textual-knowledge-for-visual87.8
rethinking-spatiotemporal-feature-learning77.2
videomae-masked-autoencoders-are-data-187.4
video-swin-transformer78.8
holistic-large-scale-video-understanding77.6
more-is-less-learning-efficient-video-173.5
spatiotemporal-self-attention-modeling-with82.5
scaling-vision-transformers-to-22-billion88.0
vidtr-video-transformer-without-convolutions79.7
enhancing-video-transformers-for-action93.4
continual-3d-convolutional-neural-networks64.71
coca-contrastive-captioners-are-image-text88.0
an-image-is-worth-16x16-words-what-is-a-video80.5
continual-3d-convolutional-neural-networks71.61
d3d-distilled-3d-networks-for-video-action76.5
omni-sourced-webly-supervised-learning-for80.4
asymmetric-masked-distillation-for-pre82.2
slowfast-networks-for-video-recognition-
collaborative-spatiotemporal-feature-learning77.5
continual-3d-convolutional-neural-networks67.42
faster-recurrent-networks-for-video71.7
video-swin-transformer82.7
motionsqueeze-neural-motion-feature-learning76.4
temporal-shift-module-for-efficient-video74.7
is-space-time-attention-all-you-need-for80.7
continual-3d-convolutional-neural-networks67.33
mar-masked-autoencoders-for-efficient-action83.9
unmasked-teacher-towards-training-efficient90.6
multiscale-vision-transformers78.4
tdn-temporal-difference-networks-for79.4
is-space-time-attention-all-you-need-for79.7
video-swin-transformer83.1
x3d-expanding-architectures-for-efficient80.4
masked-feature-prediction-for-self-supervised87.0
tada-temporally-adaptive-convolutions-for-177.4
a-closer-look-at-spatiotemporal-convolutions72
implicit-temporal-modeling-with-learnable85.7
mar-masked-autoencoders-for-efficient-action79.4
vidtr-video-transformer-without-convolutions79.4
movinets-mobile-video-networks-for-efficient81.5
stand-alone-inter-frame-attention-in-video-183.1
cast-cross-attention-in-space-and-time-for-185.3
convnet-architecture-search-for73.9
vatt-transformers-for-multimodal-self82.1
video-classification-with-finecoarse-networks77.3
multiscale-vision-transformers81.2
learning-spatio-temporal-representation-with-372.3
video-swin-transformer80.6
vimpac-video-pre-training-via-masked-token77.4
one-peace-exploring-one-general88.1
learning-correlation-structures-for-vision83.4
a-closer-look-at-spatiotemporal-convolutions73.9
coca-contrastive-captioners-are-image-text88.9
an-image-is-worth-16x16-words-what-is-a-video79.3
videomae-masked-autoencoders-are-data-186.6
revisiting-3d-resnets-for-video-recognition80.4
video-classification-with-channel-separated77.8
continual-3d-convolutional-neural-networks65.90
internvideo2-scaling-video-foundation-models91.6
rethinking-video-vits-sparse-video-tubes-for90.2
actionclip-a-new-paradigm-for-video-action83.8
mar-masked-autoencoders-for-efficient-action85.3
uniformerv2-spatiotemporal-learning-by-arming90.0
tada-temporally-adaptive-convolutions-for-178.2
asymmetric-masked-distillation-for-pre80.1
temporally-adaptive-models-for-efficient86.4
masked-video-distillation-rethinking-masked83.4
continual-3d-convolutional-neural-networks69.29
region-based-non-local-operation-for-video77.4
omni-sourced-webly-supervised-learning-for80.5
internvideo2-scaling-video-foundation-models92.1
2103-15691-
unmasked-teacher-towards-training-efficient90.6
masked-video-distillation-rethinking-masked87.2
continual-3d-convolutional-neural-networks53.40
omni-sourced-webly-supervised-learning-for83.6
masked-video-distillation-rethinking-masked81.0
learning-spatio-temporal-representation-with-379.4
implicit-temporal-modeling-with-learnable88.7
masked-feature-prediction-for-self-supervised86.7
slowfast-networks-for-video-recognition75.6
internvideo-general-video-foundation-models91.1
uniformer-unified-transformer-for-efficient82.9
video-modeling-with-correlation-networks79.2
continual-3d-convolutional-neural-networks60.18
movinets-mobile-video-networks-for-efficient72.7
mvfnet-multi-view-fusion-network-for79.1
a-closer-look-at-spatiotemporal-convolutions72
video-transformer-network-
slowfast-networks-for-video-recognition77.9
video-classification-with-channel-separated82.6
omnivore-a-single-model-for-many-visual84.1
x3d-expanding-architectures-for-efficient76
19060055076.1
2103-1569184.9
keeping-your-eye-on-the-ball-trajectory81.1
videomae-masked-autoencoders-are-data-181.5
movinets-mobile-video-networks-for-efficient75.0
aim-adapting-image-models-for-efficient-video87.5
a2-nets-double-attention-networks74.6
swin-transformer-v2-scaling-up-capacity-and86.8
slowfast-networks-for-video-recognition79.8
non-local-neural-networks77.7
hiera-a-hierarchical-vision-transformer87.8
co-training-transformer-with-videos-and86.3
learning-spatio-temporal-representation-with-381.2
side4video-spatial-temporal-side-network-for88.6
temporal-segment-networks-towards-good73.9
videomae-masked-autoencoders-are-data-186.1
video-transformer-network-
what-can-simple-arithmetic-operations-do-for89.4
continual-3d-convolutional-neural-networks63.98
revisiting-the-effectiveness-of-off-the-shelf73.0
what-makes-training-multi-modal-networks-hard77.7
continual-3d-convolutional-neural-networks63.03
continual-3d-convolutional-neural-networks56.86
videomae-masked-autoencoders-are-data-185.2
continual-3d-convolutional-neural-networks73.05
rethinking-video-vits-sparse-video-tubes-for90.9
movinets-mobile-video-networks-for-efficient80.9
rethinking-spatiotemporal-feature-learning74.7
improved-multiscale-vision-transformers-for86.1
movinets-mobile-video-networks-for-efficient78.2
mplug-2-a-modularized-multi-modal-foundation87.1
continual-3d-convolutional-neural-networks59.37
stm-spatiotemporal-and-motion-encoding-for73.7
dual-path-adaptation-from-image-to-video87.7
parameter-efficient-image-to-video-transfer87.2
masked-video-distillation-rethinking-masked86.4
19080762578.8
omnivl-one-foundation-model-for-image79.1
a-closer-look-at-spatiotemporal-convolutions67.5
omnivec2-a-novel-transformer-based-network93.6
quo-vadis-action-recognition-a-new-model-and71.1
multiscale-vision-transformers76
two-stream-video-classification-with-cross75.98
faster-recurrent-networks-for-video75.1
is-space-time-attention-all-you-need-for78
rethinking-video-vits-sparse-video-tubes-for88.6
video-classification-with-channel-separated79.2
representation-flow-for-action-recognition77.9
temporally-adaptive-models-for-efficient89.9
continual-3d-convolutional-neural-networks71.03
eva-exploring-the-limits-of-masked-visual89.7
evolving-space-time-neural-architectures-for77.4
large-scale-weakly-supervised-pre-training82.8
slowfast-networks-for-video-recognition77
x3d-expanding-architectures-for-efficient79.1
continual-3d-convolutional-neural-networks59.52
appearance-and-relation-networks-for-video72.4
continual-3d-convolutional-neural-networks53.52
a-closer-look-at-spatiotemporal-convolutions75.4
vidtr-video-transformer-without-convolutions80.5
continual-3d-convolutional-neural-networks62.80
videomae-v2-scaling-video-masked-autoencoders88.5
bidirectional-cross-modal-knowledge88.7
videomae-v2-scaling-video-masked-autoencoders90.0
video-swin-transformer84.9
video-transformer-network78.6
multi-fiber-networks-for-video-recognition72.8
movinets-mobile-video-networks-for-efficient65.8
ct-net-channel-tensorization-network-for-179.8
continual-3d-convolutional-neural-networks67.24
continual-3d-convolutional-neural-networks68.45
zeroi2v-zero-cost-adaptation-of-pre-trained87.2
frozen-clip-models-are-efficient-video87.7
drop-an-octave-reducing-spatial-redundancy-in75.7
dual-path-adaptation-from-image-to-video85.4
improved-multiscale-vision-transformers-for-