Action Recognition In Videos On Something
평가 지표
Top-1 Accuracy
Top-5 Accuracy
평가 결과
이 벤치마크에서 각 모델의 성능 결과
비교 표
모델 이름 | Top-1 Accuracy | Top-5 Accuracy |
---|---|---|
temporal-reasoning-graph-for-activity | 62.2 | 90.3 |
mvfnet-multi-view-fusion-network-for | 66.3 | - |
keeping-your-eye-on-the-ball-trajectory | 68.1 | 91.2 |
visual-representation-learning-from-unlabeled | 73.7 | - |
more-is-less-learning-efficient-video-1 | 65.2 | - |
spatiotemporal-self-attention-modeling-with | 69.8 | - |
videomae-masked-autoencoders-are-data-1 | 70.8 | 92.4 |
motionsqueeze-neural-motion-feature-learning | 66.6 | 90.6 |
tada-temporally-adaptive-convolutions-for-1 | 64.0 | 88.0 |
videomae-masked-autoencoders-are-data-1 | 75.4 | 95.2 |
pan-towards-fast-action-recognition-via | 66.5 | 90.6 |
movinets-mobile-video-networks-for-efficient | 61.3 | 88.2 |
keeping-your-eye-on-the-ball-trajectory | 67.1 | 90.6 |
object-region-video-transformers-1 | 67.9 | 90.5 |
self-supervised-video-transformer | 59.2 | - |
comparative-analysis-of-cnn-based | 47.73 | - |
is-space-time-attention-all-you-need-for | 62.3 | - |
mutual-modality-learning-for-video-action | 66.83 | 91.30 |
asymmetric-masked-distillation-for-pre | 70.2 | 92.5 |
uniformer-unified-transformer-for-efficient | 71.2 | 92.8 |
asymmetric-masked-distillation-for-pre | 73.3 | 94.0 |
ct-net-channel-tensorization-network-for-1 | 67.8 | 91.1 |
improved-multiscale-vision-transformers-for | - | - |
a-multigrid-method-for-efficiently-training | 61.7 | - |
video-swin-transformer | 69.6 | 92.7 |
masked-video-distillation-rethinking-masked | 77.3 | 95.7 |
mar-masked-autoencoders-for-efficient-action | 69.5 | 91.9 |
maximizing-spatio-temporal-entropy-of-deep-3d | 65.7 | 89.8 |
masked-video-distillation-rethinking-masked | 70.9 | 92.8 |
vimpac-video-pre-training-via-masked-token | 68.1 | - |
movinets-mobile-video-networks-for-efficient | 62.7 | 89.0 |
internvideo2-scaling-video-foundation-models | 1 | 12 |
mlp-3d-a-mlp-like-3d-architecture-with-1 | 68.5 | - |
the-something-something-video-database-for | 51.33 | 80.46 |
diverse-temporal-aggregation-and-depthwise | 64.1 | 88.6 |
movinets-mobile-video-networks-for-efficient | - | - |
cast-cross-attention-in-space-and-time-for-1 | 71.6 | - |
learning-self-similarity-in-space-and-time-as-1 | 65.7 | 89.8 |
relational-self-attention-what-s-missing-in | - | 91.1 |
space-time-mixing-attention-for-video | 67.2 | 90.8 |
direcformer-a-directed-attention-in | 64.94 | 87.9 |
relational-self-attention-what-s-missing-in | 64.8 | 89.1 |
object-region-video-transformers-1 | 69.5 | 91.5 |
masked-feature-prediction-for-self-supervised | 75.0 | 95.0 |
slow-fast-visual-tempo-learning-for-video | 67.8 | - |
slowfast-networks-for-video-recognition | 61.7 | - |
masked-video-distillation-rethinking-masked | 76.7 | 95.5 |
2103-15691 | 65.4 | 89.8 |
implicit-temporal-modeling-with-learnable | 70.2 | 91.8 |
tada-temporally-adaptive-convolutions-for-1 | 67.2 | 89.8 |
relational-self-attention-what-s-missing-in | 66 | 89.8 |
diverse-temporal-aggregation-and-depthwise | 65.8 | 89.5 |
is-space-time-attention-all-you-need-for | 59.5 | - |
learning-correlation-structures-for-vision | 71.5 | - |
mutual-modality-learning-for-video-action | 69.02 | 92.70 |
omnivore-a-single-model-for-many-visual | 71.4 | 93.5 |
vidtr-video-transformer-without-convolutions | 60.2 | - |
spatial-temporal-pyramid-graph-reasoning-for | 67.0 | - |
morphmlp-a-self-attention-free-mlp-like | 70.1 | 92.8 |
diverse-temporal-aggregation-and-depthwise | 63.2 | 88.2 |
mar-masked-autoencoders-for-efficient-action | 73.8 | 94.4 |
learning-self-similarity-in-space-and-time-as-1 | 67.7 | 91.1 |
stand-alone-inter-frame-attention-in-video-1 | 69.8 | - |
group-contextualization-for-video-recognition | 67.8 | 91.2 |
global-temporal-difference-network-for-action | 67.6 | - |
videomae-v2-scaling-video-masked-autoencoders | 77.0 | 95.9 |
relational-self-attention-what-s-missing-in | 67.7 | 91.1 |
hiera-a-hierarchical-vision-transformer | 76.5 | - |
tada-temporally-adaptive-convolutions-for-1 | 67.1 | 90.4 |
omnivl-one-foundation-model-for-image | 62.5 | 86.2 |
tdn-temporal-difference-networks-for | 69.6 | 92.2 |
co-training-transformer-with-videos-and | 70.9 | 92.5 |
multiscale-vision-transformers | 67.8 | 91.3 |
knowing-what-where-and-when-to-look-efficient | 66.5 | 90.4 |
improved-multiscale-vision-transformers-for | 73.3 | 94.1 |
zeroi2v-zero-cost-adaptation-of-pre-trained | 72.2 | 93.0 |
action-recognition-with-motion | 67.1 | - |
paying-more-attention-to-motion-attention | 49.9 | 79.1 |
improved-multiscale-vision-transformers-for | - | 93.4 |
diverse-temporal-aggregation-and-depthwise | 65.24 | 89.48 |
internvideo2-scaling-video-foundation-models | 77.1 | - |
temporal-reasoning-graph-for-activity | 61.3 | 91.4 |
internvideo-general-video-foundation-models | 77.2 | - |
temporal-pyramid-network-for-action | 62.0 | - |
motionsqueeze-neural-motion-feature-learning | 64.7 | 89.4 |
multiscale-vision-transformers | 66.2 | 90.2 |
uniformerv2-spatiotemporal-learning-by-arming | 73.0 | 94.5 |
learning-self-similarity-in-space-and-time-as-1 | 67.4 | 91 |
uniformer-unified-transformer-for-efficient | 69.4 | 92.1 |
diverse-temporal-aggregation-and-depthwise | 67.35 | 90.50 |
multi-scale-motion-aware-module-for-video | 68.2 | - |
few-shot-video-classification-via-temporal | 52.3 | - |
what-can-simple-arithmetic-operations-do-for | 74.6 | 94.4 |
relational-self-attention-what-s-missing-in | 67.3 | 90.8 |
multiview-transformers-for-video-recognition | 68.5 | 90.4 |
bevt-bert-pretraining-of-video-transformers | 71.4 | - |
action-keypoint-network-for-efficient-video | 64.3 | - |
tdn-temporal-difference-networks-for | 68.2 | 91.6 |
motionsqueeze-neural-motion-feature-learning | 63 | 88.4 |
mar-masked-autoencoders-for-efficient-action | 74.7 | 94.9 |
multiscale-vision-transformers | 68.7 | 91.5 |
diverse-temporal-aggregation-and-depthwise | 64.2 | 88.8 |
side4video-spatial-temporal-side-network-for | 75.2 | 94.0 |
cooperative-cross-stream-network-for | 61.2 | 89.3 |
masked-video-distillation-rethinking-masked | 73.7 | 94.0 |
improved-multiscale-vision-transformers-for | 72.1 | - |
mar-masked-autoencoders-for-efficient-action | 71.0 | 92.8 |
movinets-mobile-video-networks-for-efficient | 63.5 | 89.0 |
co-training-transformer-with-videos-and | 69.8 | 91.9 |
temporal-shift-module-for-efficient-video | 66.6 | 91.3 |
keeping-your-eye-on-the-ball-trajectory | 66.5 | 90.1 |
rethinking-video-vits-sparse-video-tubes-for | 76.1 | 95.2 |
tds-clip-temporal-difference-side-network-for | 73.4 | 93.8 |
is-space-time-attention-all-you-need-for | 62.5 | - |
the-effectiveness-of-mae-pre-pretraining-for | 74.4 | - |
tada-temporally-adaptive-convolutions-for-1 | 65.6 | 89.2 |
videomae-masked-autoencoders-are-data-1 | 74.3 | 94.6 |
implicit-temporal-modeling-with-learnable | 66.8 | 90.3 |
temporally-adaptive-models-for-efficient | 71.1 | - |
temporally-adaptive-models-for-efficient | 73.6 | - |
parameter-efficient-image-to-video-transfer | 72.3 | 93.9 |
prompt-learning-for-action-recognition | 67.3 | 91 |