Action Classification On Kinetics 400
평가 지표
Acc@1
평가 결과
이 벤치마크에서 각 모델의 성능 결과
비교 표
모델 이름 | Acc@1 |
---|---|
omnivec-learning-robust-representations-with | 91.1 |
x3d-expanding-architectures-for-efficient | 77.5 |
video-transformer-network | 79.8 |
multiscale-vision-transformers | 80.2 |
multiview-transformers-for-video-recognition | 89.9 |
adamae-adaptive-masking-for-efficient | 81.7 |
visual-representation-learning-from-unlabeled | 85.1 |
movinets-mobile-video-networks-for-efficient | 80.5 |
slowfast-networks-for-video-recognition | 78.9 |
a-closer-look-at-spatiotemporal-convolutions | 74.3 |
expanding-language-image-pretrained-models | 87.7 |
video-classification-with-channel-separated | 82.5 |
mars-motion-augmented-rgb-stream-for-action | 74.9 |
tokenlearner-what-can-8-learned-tokens-do-for | 85.4 |
videomamba-state-space-model-for-efficient | 85.0 |
tada-temporally-adaptive-convolutions-for-1 | 79.1 |
mar-masked-autoencoders-for-efficient-action | 81.0 |
video-swin-transformer | 80.6 |
omnivore-a-single-model-for-many-visual | 84.0 |
rethinking-spatiotemporal-feature-learning | 68 |
co-training-transformer-with-videos-and | 87.2 |
d3d-distilled-3d-networks-for-video-action | 75.9 |
what-makes-training-multi-modal-networks-hard | 78.9 |
continual-3d-convolutional-neural-networks | 59.58 |
implicit-temporal-modeling-with-learnable | 82.4 |
attention-bottlenecks-for-multimodal-fusion | 80.8 |
mlp-3d-a-mlp-like-3d-architecture-with-1 | 81.4 |
video-classification-with-channel-separated | 81.3 |
tada-temporally-adaptive-convolutions-for-1 | 76.7 |
self-supervised-video-transformer | 78.1 |
direcformer-a-directed-attention-in | 82.75 |
mars-motion-augmented-rgb-stream-for-action | 68.9 |
am-flow-adapters-for-temporal-processing-in | 89.6 |
continual-3d-convolutional-neural-networks | 67.06 |
victr-video-conditioned-text-representations | 87.0 |
transferring-textual-knowledge-for-visual | 87.8 |
rethinking-spatiotemporal-feature-learning | 77.2 |
videomae-masked-autoencoders-are-data-1 | 87.4 |
video-swin-transformer | 78.8 |
holistic-large-scale-video-understanding | 77.6 |
more-is-less-learning-efficient-video-1 | 73.5 |
spatiotemporal-self-attention-modeling-with | 82.5 |
scaling-vision-transformers-to-22-billion | 88.0 |
vidtr-video-transformer-without-convolutions | 79.7 |
enhancing-video-transformers-for-action | 93.4 |
continual-3d-convolutional-neural-networks | 64.71 |
coca-contrastive-captioners-are-image-text | 88.0 |
an-image-is-worth-16x16-words-what-is-a-video | 80.5 |
continual-3d-convolutional-neural-networks | 71.61 |
d3d-distilled-3d-networks-for-video-action | 76.5 |
omni-sourced-webly-supervised-learning-for | 80.4 |
asymmetric-masked-distillation-for-pre | 82.2 |
slowfast-networks-for-video-recognition | - |
collaborative-spatiotemporal-feature-learning | 77.5 |
continual-3d-convolutional-neural-networks | 67.42 |
faster-recurrent-networks-for-video | 71.7 |
video-swin-transformer | 82.7 |
motionsqueeze-neural-motion-feature-learning | 76.4 |
temporal-shift-module-for-efficient-video | 74.7 |
is-space-time-attention-all-you-need-for | 80.7 |
continual-3d-convolutional-neural-networks | 67.33 |
mar-masked-autoencoders-for-efficient-action | 83.9 |
unmasked-teacher-towards-training-efficient | 90.6 |
multiscale-vision-transformers | 78.4 |
tdn-temporal-difference-networks-for | 79.4 |
is-space-time-attention-all-you-need-for | 79.7 |
video-swin-transformer | 83.1 |
x3d-expanding-architectures-for-efficient | 80.4 |
masked-feature-prediction-for-self-supervised | 87.0 |
tada-temporally-adaptive-convolutions-for-1 | 77.4 |
a-closer-look-at-spatiotemporal-convolutions | 72 |
implicit-temporal-modeling-with-learnable | 85.7 |
mar-masked-autoencoders-for-efficient-action | 79.4 |
vidtr-video-transformer-without-convolutions | 79.4 |
movinets-mobile-video-networks-for-efficient | 81.5 |
stand-alone-inter-frame-attention-in-video-1 | 83.1 |
cast-cross-attention-in-space-and-time-for-1 | 85.3 |
convnet-architecture-search-for | 73.9 |
vatt-transformers-for-multimodal-self | 82.1 |
video-classification-with-finecoarse-networks | 77.3 |
multiscale-vision-transformers | 81.2 |
learning-spatio-temporal-representation-with-3 | 72.3 |
video-swin-transformer | 80.6 |
vimpac-video-pre-training-via-masked-token | 77.4 |
one-peace-exploring-one-general | 88.1 |
learning-correlation-structures-for-vision | 83.4 |
a-closer-look-at-spatiotemporal-convolutions | 73.9 |
coca-contrastive-captioners-are-image-text | 88.9 |
an-image-is-worth-16x16-words-what-is-a-video | 79.3 |
videomae-masked-autoencoders-are-data-1 | 86.6 |
revisiting-3d-resnets-for-video-recognition | 80.4 |
video-classification-with-channel-separated | 77.8 |
continual-3d-convolutional-neural-networks | 65.90 |
internvideo2-scaling-video-foundation-models | 91.6 |
rethinking-video-vits-sparse-video-tubes-for | 90.2 |
actionclip-a-new-paradigm-for-video-action | 83.8 |
mar-masked-autoencoders-for-efficient-action | 85.3 |
uniformerv2-spatiotemporal-learning-by-arming | 90.0 |
tada-temporally-adaptive-convolutions-for-1 | 78.2 |
asymmetric-masked-distillation-for-pre | 80.1 |
temporally-adaptive-models-for-efficient | 86.4 |
masked-video-distillation-rethinking-masked | 83.4 |
continual-3d-convolutional-neural-networks | 69.29 |
region-based-non-local-operation-for-video | 77.4 |
omni-sourced-webly-supervised-learning-for | 80.5 |
internvideo2-scaling-video-foundation-models | 92.1 |
2103-15691 | - |
unmasked-teacher-towards-training-efficient | 90.6 |
masked-video-distillation-rethinking-masked | 87.2 |
continual-3d-convolutional-neural-networks | 53.40 |
omni-sourced-webly-supervised-learning-for | 83.6 |
masked-video-distillation-rethinking-masked | 81.0 |
learning-spatio-temporal-representation-with-3 | 79.4 |
implicit-temporal-modeling-with-learnable | 88.7 |
masked-feature-prediction-for-self-supervised | 86.7 |
slowfast-networks-for-video-recognition | 75.6 |
internvideo-general-video-foundation-models | 91.1 |
uniformer-unified-transformer-for-efficient | 82.9 |
video-modeling-with-correlation-networks | 79.2 |
continual-3d-convolutional-neural-networks | 60.18 |
movinets-mobile-video-networks-for-efficient | 72.7 |
mvfnet-multi-view-fusion-network-for | 79.1 |
a-closer-look-at-spatiotemporal-convolutions | 72 |
video-transformer-network | - |
slowfast-networks-for-video-recognition | 77.9 |
video-classification-with-channel-separated | 82.6 |
omnivore-a-single-model-for-many-visual | 84.1 |
x3d-expanding-architectures-for-efficient | 76 |
190600550 | 76.1 |
2103-15691 | 84.9 |
keeping-your-eye-on-the-ball-trajectory | 81.1 |
videomae-masked-autoencoders-are-data-1 | 81.5 |
movinets-mobile-video-networks-for-efficient | 75.0 |
aim-adapting-image-models-for-efficient-video | 87.5 |
a2-nets-double-attention-networks | 74.6 |
swin-transformer-v2-scaling-up-capacity-and | 86.8 |
slowfast-networks-for-video-recognition | 79.8 |
non-local-neural-networks | 77.7 |
hiera-a-hierarchical-vision-transformer | 87.8 |
co-training-transformer-with-videos-and | 86.3 |
learning-spatio-temporal-representation-with-3 | 81.2 |
side4video-spatial-temporal-side-network-for | 88.6 |
temporal-segment-networks-towards-good | 73.9 |
videomae-masked-autoencoders-are-data-1 | 86.1 |
video-transformer-network | - |
what-can-simple-arithmetic-operations-do-for | 89.4 |
continual-3d-convolutional-neural-networks | 63.98 |
revisiting-the-effectiveness-of-off-the-shelf | 73.0 |
what-makes-training-multi-modal-networks-hard | 77.7 |
continual-3d-convolutional-neural-networks | 63.03 |
continual-3d-convolutional-neural-networks | 56.86 |
videomae-masked-autoencoders-are-data-1 | 85.2 |
continual-3d-convolutional-neural-networks | 73.05 |
rethinking-video-vits-sparse-video-tubes-for | 90.9 |
movinets-mobile-video-networks-for-efficient | 80.9 |
rethinking-spatiotemporal-feature-learning | 74.7 |
improved-multiscale-vision-transformers-for | 86.1 |
movinets-mobile-video-networks-for-efficient | 78.2 |
mplug-2-a-modularized-multi-modal-foundation | 87.1 |
continual-3d-convolutional-neural-networks | 59.37 |
stm-spatiotemporal-and-motion-encoding-for | 73.7 |
dual-path-adaptation-from-image-to-video | 87.7 |
parameter-efficient-image-to-video-transfer | 87.2 |
masked-video-distillation-rethinking-masked | 86.4 |
190807625 | 78.8 |
omnivl-one-foundation-model-for-image | 79.1 |
a-closer-look-at-spatiotemporal-convolutions | 67.5 |
omnivec2-a-novel-transformer-based-network | 93.6 |
quo-vadis-action-recognition-a-new-model-and | 71.1 |
multiscale-vision-transformers | 76 |
two-stream-video-classification-with-cross | 75.98 |
faster-recurrent-networks-for-video | 75.1 |
is-space-time-attention-all-you-need-for | 78 |
rethinking-video-vits-sparse-video-tubes-for | 88.6 |
video-classification-with-channel-separated | 79.2 |
representation-flow-for-action-recognition | 77.9 |
temporally-adaptive-models-for-efficient | 89.9 |
continual-3d-convolutional-neural-networks | 71.03 |
eva-exploring-the-limits-of-masked-visual | 89.7 |
evolving-space-time-neural-architectures-for | 77.4 |
large-scale-weakly-supervised-pre-training | 82.8 |
slowfast-networks-for-video-recognition | 77 |
x3d-expanding-architectures-for-efficient | 79.1 |
continual-3d-convolutional-neural-networks | 59.52 |
appearance-and-relation-networks-for-video | 72.4 |
continual-3d-convolutional-neural-networks | 53.52 |
a-closer-look-at-spatiotemporal-convolutions | 75.4 |
vidtr-video-transformer-without-convolutions | 80.5 |
continual-3d-convolutional-neural-networks | 62.80 |
videomae-v2-scaling-video-masked-autoencoders | 88.5 |
bidirectional-cross-modal-knowledge | 88.7 |
videomae-v2-scaling-video-masked-autoencoders | 90.0 |
video-swin-transformer | 84.9 |
video-transformer-network | 78.6 |
multi-fiber-networks-for-video-recognition | 72.8 |
movinets-mobile-video-networks-for-efficient | 65.8 |
ct-net-channel-tensorization-network-for-1 | 79.8 |
continual-3d-convolutional-neural-networks | 67.24 |
continual-3d-convolutional-neural-networks | 68.45 |
zeroi2v-zero-cost-adaptation-of-pre-trained | 87.2 |
frozen-clip-models-are-efficient-video | 87.7 |
drop-an-octave-reducing-spatial-redundancy-in | 75.7 |
dual-path-adaptation-from-image-to-video | 85.4 |
improved-multiscale-vision-transformers-for | - |