Action Recognition In Videos On Something 1
Metrics
GFLOPs
Param.
Top 1 Accuracy
Top 5 Accuracy
Results
Performance results of various models on this benchmark
Comparison Table
Model Name | GFLOPs | Param. | Top 1 Accuracy | Top 5 Accuracy |
---|---|---|---|---|
diverse-temporal-aggregation-and-depthwise | 9.3x6 | 5.8M | 49.5 | 78.0 |
video-classification-with-channel-separated | - | - | 53.3 | - |
videomae-v2-scaling-video-masked-autoencoders | - | - | 68.7 | 91.9 |
hierarchical-feature-aggregation-networks-for | - | - | 41.97 | - |
eco-efficient-convolutional-network-for | - | - | 46.4 | - |
diverse-temporal-aggregation-and-depthwise | 11.5x6 | 3.3M | 49.8 | 78.0 |
spatiotemporal-self-attention-modeling-with | - | - | 58.3 | - |
diverse-temporal-aggregation-and-depthwise | 20.9x6 | 5.8M | 54.59 | 82.30 |
temporally-adaptive-models-for-efficient | - | - | 60.7 | - |
temporally-adaptive-models-for-efficient | - | - | 63.7 | - |
multi-scale-motion-aware-module-for-video | - | - | 57.9 | - |
temporal-shift-module-for-efficient-video | - | - | 50.7 | - |
slow-fast-visual-tempo-learning-for-video | - | - | 57.2 | - |
mars-motion-augmented-rgb-stream-for-action | - | - | 40.4 | - |
ean-event-adaptive-network-for-enhanced | - | - | 57.2 | 83.9 |
moments-in-time-dataset-one-million-videos | - | - | 48.6 | - |
temporal-relational-reasoning-in-videos | - | - | 42.01 | - |
video-classification-with-channel-separated | - | - | 48.4 | - |
learning-self-similarity-in-space-and-time-as-1 | - | - | 54.3 | 82.9 |
ct-net-channel-tensorization-network-for-1 | - | - | 56.6 | - |
uniformerv2-spatiotemporal-learning-by-arming | - | - | 62.7 | 88.0 |
non-local-neural-networks | - | - | 44.4 | - |
internvideo-general-video-foundation-models | - | - | 70.0 | - |
190807625 | - | - | 53.4 | - |
relational-self-attention-what-s-missing-in | - | - | 56.1 | 82.8 |
temporal-shift-module-for-efficient-video | - | - | 49.7 | 78.5 |
video-classification-with-finecoarse-networks | - | - | 57.1 | 84.2 |
tds-clip-temporal-difference-side-network-for | - | - | 63.0 | 87.8 |
motionsqueeze-neural-motion-feature-learning | - | - | 52.1 | 82.3 |
diverse-temporal-aggregation-and-depthwise | 5.7x6 | 3.3M | 48.1 | 76.9 |
learning-correlation-structures-for-vision | - | - | 61.3 | - |
video-classification-with-channel-separated | - | - | 51.6 | - |
motion-feature-network-fixed-motion-filter | - | - | 43.9 | - |
moments-in-time-dataset-one-million-videos | - | - | 50 | - |
uniformer-unified-transformer-for-efficient | 41.8x3 | 21.4 | 57.6 | 84.9 |
temporal-reasoning-graph-for-activity | - | - | 49.5 | 86.1 |
pan-towards-fast-action-recognition-via | - | - | 55.3 | 82.8 |
relational-self-attention-what-s-missing-in | - | - | 51.9 | 79.6 |
relational-self-attention-what-s-missing-in | - | - | 54.0 | 81.1 |
temporal-shift-module-for-efficient-video | - | - | 47.2 | 77.1 |
spatial-temporal-pyramid-graph-reasoning-for | - | - | 53.5 | - |
rethinking-spatiotemporal-feature-learning | - | - | 48.2 | 78.7 |
region-based-non-local-operation-for-video | - | - | 52.7 | 81.5 |
side4video-spatial-temporal-side-network-for | - | - | 67.3 | 88.8 |
mlp-3d-a-mlp-like-3d-architecture-with-1 | - | - | 56.5 | - |
uniformer-unified-transformer-for-efficient | 259x3 | 50.1 | 60.9 | 87.3 |
video-classification-with-channel-separated | - | - | 52.1 | - |
temporal-reasoning-graph-for-activity | - | - | 49.7 | - |
ae-net-adjoint-enhancement-network-for | - | - | 55.0 | - |
knowing-what-where-and-when-to-look-efficient | - | - | 52.6 | 81.3 |
gate-shift-networks-for-video-action | - | - | 55.16 | - |
stand-alone-inter-frame-attention-in-video-1 | - | - | 57.3 | - |
video-classification-with-channel-separated | - | - | 49.3 | - |
action-recognition-with-motion | - | - | 56.6 | - |
what-can-simple-arithmetic-operations-do-for | - | - | 65.6 | 88.6 |
videos-as-space-time-region-graphs | - | - | 46.1 | - |
learning-self-similarity-in-space-and-time-as-1 | - | - | 56.6 | 84.4 |
recurrent-space-time-graphs-for-video | - | - | 49.2 | - |
temporal-relational-reasoning-in-videos | - | - | 34.4 | - |
motionsqueeze-neural-motion-feature-learning | - | - | 55.1 | - |
motionsqueeze-neural-motion-feature-learning | - | - | 50.9 | 80.3 |
relational-self-attention-what-s-missing-in | - | - | 55.5 | 82.6 |
learning-self-similarity-in-space-and-time-as-1 | - | - | 55.8 | 83.9 |
action-keypoint-network-for-efficient-video | - | - | 52.5 | - |
mvfnet-multi-view-fusion-network-for | - | - | 54.0 | - |
diverse-temporal-aggregation-and-depthwise | 11.5x6 | 3.3M | 52.68 | 80.43 |
diverse-temporal-aggregation-and-depthwise | 20.9x6 | 5.8M | 50.6 | 78.7 |
motionsqueeze-neural-motion-feature-learning | - | - | 54.4 | 83.8 |
region-based-non-local-operation-for-video | - | - | 54.1 | 82.2 |
gate-shift-networks-for-video-action | - | - | 51.68 | - |
tdn-temporal-difference-networks-for | - | - | 56.8 | 84.1 |
mars-motion-augmented-rgb-stream-for-action | - | - | 53.0 | - |
eco-efficient-convolutional-network-for | - | - | 46.4 | - |
rethinking-spatiotemporal-feature-learning | - | - | 47.3 | 78.1 |