Image Classification On Imagenet
評価指標
Hardware Burden
Number of params
Operations per network pass
Top 1 Accuracy
評価結果
このベンチマークにおける各モデルのパフォーマンス結果
比較表
モデル名 | Hardware Burden | Number of params | Operations per network pass | Top 1 Accuracy |
---|---|---|---|---|
xception-deep-learning-with-depthwise | 87G | 22.855952M | 0.838G | 79% |
deep-residual-learning-for-image-recognition | - | 40M | - | 78.25% |
cvt-introducing-convolutions-to-vision | - | 20M | - | 83% |
densely-connected-convolutional-networks | - | - | - | 77.42% |
convit-improving-vision-transformers-with | - | 10M | - | 76.7% |
when-vision-transformers-outperform-resnets | - | 87M | - | 79.9% |
pvtv2-improved-baselines-with-pyramid-vision | - | 25.4M | - | 82% |
metaformer-baselines-for-vision | - | 40M | - | 85.4% |
mambavision-a-hybrid-mamba-transformer-vision | - | 241.5M | - | 85.3% |
fbnetv5-neural-architecture-search-for | - | - | - | 78.4% |
gated-convolutional-networks-with-hybrid | - | 12.9M | - | 78.5% |
convit-improving-vision-transformers-with | - | 48M | - | 82.2% |
spatial-channel-token-distillation-for-vision | - | 122.6M | - | 82.4% |
rethinking-and-improving-relative-position | - | 22M | - | 80.9% |
bottleneck-transformers-for-visual | - | 53.9M | - | 84% |
augmenting-sub-model-to-improve-main-model | - | 86.6M | - | 84.2% |
maxvit-multi-axis-vision-transformer | - | 120M | - | 84.94% |
biformer-vision-transformer-with-bi-level | - | - | - | 84.3% |
incorporating-convolution-designs-into-visual | - | 6.4M | - | 76.4% |
automix-unveiling-the-power-of-mixup | - | 44.6M | - | 80.98% |
resnest-split-attention-networks | - | 27.5M | - | 81.13% |
dilated-neighborhood-attention-transformer | - | 90M | - | 84.4% |
torchdistill-a-modular-configuration-driven | - | - | - | 71.08% |
next-vit-next-generation-vision-transformer | - | 44.8M | - | 83.2% |
improved-multiscale-vision-transformers-for | - | 218M | - | 86.3% |
cyclemlp-a-mlp-like-architecture-for-dense | - | 76M | - | 83.2% |
efficientnetv2-smaller-models-and-faster | - | 54M | - | 86.2% |
sp-vit-learning-2d-spatial-priors-for-vision | - | - | - | 83.9% |
an-improved-one-millisecond-mobile-backbone | - | 2.1M | - | 72.5% |
high-performance-large-scale-image | - | 193.8M | - | 85.1% |
mixpro-data-augmentation-with-maskmix-and | - | - | - | 76.7% |
co-training-2-l-submodels-for-visual | - | - | - | 87.5% |
hvt-a-comprehensive-vision-framework-for | - | - | - | 87.4% |
basisnet-two-stage-model-synthesis-for-1 | - | - | - | 80% |
vitae-vision-transformer-advanced-by | - | - | - | 75.3% |
semi-supervised-recognition-under-a-noisy-and | - | 25.58M | - | 83.0% |
モデル 37 | - | - | - | 80.91% |
alphanet-improved-training-of-supernet-with | - | - | - | 80.0% |
which-transformer-to-favor-a-comparative | - | - | - | 82.29% |
localvit-bringing-locality-to-vision | - | 22.4M | - | 80.8% |
bottleneck-transformers-for-visual | - | - | - | 83.5% |
a-convnet-for-the-2020s | - | 350M | - | 87.8% |
an-improved-one-millisecond-mobile-backbone | - | 4.8M | - | 75.9 |
an-image-is-worth-16x16-words-transformers-1 | - | - | - | 87.76% |
190409925 | - | - | - | 79.1% |
shape-texture-debiased-neural-network-1 | - | 81.2 | ||
gswin-gated-mlp-vision-model-with | - | 39.8M | - | 83.01% |
revbifpn-the-fully-reversible-bidirectional | - | 48.7M | - | 83% |
rethinking-spatial-dimensions-of-vision | - | 4.9M | - | 74.6% |
dilated-neighborhood-attention-transformer | - | - | - | 86.5% |
an-evolutionary-approach-to-dynamic | - | - | - | 86.74% |
2103-14899 | - | 27.4M | - | 81.5% |
exploring-target-representations-for-masked | - | - | - | 85.7% |
on-the-adequacy-of-untuned-warmup-for | - | - | - | 72.1% |
alphanet-improved-training-of-supernet-with | - | - | - | 77.8% |
multigrain-a-unified-image-embedding-for | - | - | - | 83.2% |
gswin-gated-mlp-vision-model-with | - | 15.5M | - | 80.32% |
モデル 58 | - | - | - | 80.25% |
wavemix-lite-a-resource-efficient-neural-1 | - | 32.4M | - | 67.7% |
randaugment-practical-data-augmentation-with | - | - | - | 85.4% |
maxvit-multi-axis-vision-transformer | - | - | - | 88.32% |
gtp-vit-efficient-vision-transformers-via | - | - | - | 85.4% |
an-improved-one-millisecond-mobile-backbone | - | 14.8M | - | 79.4% |
cvt-introducing-convolutions-to-vision | - | 32M | - | 84.9% |
splitnet-divide-and-co-training | - | 88.6M | - | 82.13% |
tinyvit-fast-pretraining-distillation-for | - | 21M | - | 84.8% |
performance-of-gaussian-mixture-model | - | - | - | 84.1% |
resnest-split-attention-networks | - | 70M | - | 83.9% |
what-do-deep-networks-like-to-see | - | - | - | 77.12% |
filter-response-normalization-layer | - | - | - | 78.95% |
do-you-even-need-attention-a-stack-of-feed | - | 74.9 | ||
repmlp-re-parameterizing-convolutions-into | 52.77M | 78.60% | ||
scaling-vision-with-sparse-mixture-of-experts | - | 7200M | - | 88.36% |
drop-an-octave-reducing-spatial-redundancy-in | 20771G | 66.8M | 2.22G | 82.9% |
self-training-with-noisy-student-improves | - | 9.2M | - | 82.4% |
visformer-the-vision-friendly-transformer | - | 10.3M | - | 78.6% |
mixnet-mixed-depthwise-convolutional-kernels | - | 4.1M | - | 75.8% |
three-things-everyone-should-know-about | - | - | - | 83.4% |
multigrain-a-unified-image-embedding-for | - | - | - | 83.6% |
dynamic-convolution-attention-over | - | 18.6M | - | 67.7% |
token-labeling-training-a-85-5-top-1-accuracy | - | 26M | - | 83.3% |
metaformer-baselines-for-vision | - | 56M | - | 85.2% |
rethinking-local-perception-in-lightweight | - | 4.2M | - | 77% |
lets-keep-it-simple-using-simple | - | 3M | - | 75.66 |
elsa-enhanced-local-self-attention-for-vision | - | 298M | - | 87.2% |
an-improved-one-millisecond-mobile-backbone | - | 2.1M | - | 71.4% |
bottleneck-transformers-for-visual | - | 66.6M | - | 82.2% |
dilated-neighborhood-attention-transformer | - | 20M | - | 81.8% |
revbifpn-the-fully-reversible-bidirectional | - | 142.3M | - | 84.2% |
exploring-target-representations-for-masked | - | - | - | 88.2% |
fixing-the-train-test-resolution-discrepancy-2 | - | 9.2M | - | 83.6% |
kolmogorov-arnold-transformer | - | 86.6M | - | 81.8 |
モデル 93 | - | - | - | 70.54 |
debiformer-vision-transformer-with-deformable | - | - | - | 83.9% |
mega-moving-average-equipped-gated-attention | - | 90M | - | 82.4% |
espnetv2-a-light-weight-power-efficient-and | - | 5.9M | - | 74.9% |
multigrain-a-unified-image-embedding-for | - | - | - | 83.0% |
semi-supervised-recognition-under-a-noisy-and | - | 25.58M | - | 84.0% |
when-shift-operation-meets-vision-transformer | - | 88M | - | 83.3% |
polynomial-networks-in-deep-classifiers | - | 11.51M | - | 71.6% |
peco-perceptual-codebook-for-bert-pre | - | - | - | 87.5% |
puzzle-mix-exploiting-saliency-and-local-1 | - | 78.76% | ||
token-labeling-training-a-85-5-top-1-accuracy | - | 56M | - | 84.1% |
bottleneck-transformers-for-visual | - | - | - | 83.8% |
exploring-the-limits-of-weakly-supervised | - | 829M | - | 85.4% |
fastvit-a-fast-hybrid-vision-transformer | - | - | - | 82.6% |
training-data-efficient-image-transformers | - | 86M | - | 84.2% |
an-image-is-worth-16x16-words-transformers-1 | - | - | - | 24% |
coca-contrastive-captioners-are-image-text | 2100M | 91.0% | ||
circumventing-outliers-of-autoaugment-with | 88M | 85.8% | ||
contrastive-learning-rivals-masked-image | - | 307M | - | 89.0% |
transnext-robust-foveal-visual-perception-for | - | 12.8M | - | 82.5% |
hyenapixel-global-image-context-with | - | - | - | 84.9% |
2103-15358 | - | 39.7M | - | 83.3% |
a-dot-product-attention-free-transformer | - | 22.6M | - | 79.8% |
tinyvit-fast-pretraining-distillation-for | - | 11M | - | 83.2% |
torchdistill-a-modular-configuration-driven | - | - | - | 70.09% |
florence-a-new-foundation-model-for-computer | - | 893M | - | 90.05% |
coatnet-marrying-convolution-and-attention | - | 42M | - | 83.3% |
high-performance-large-scale-image | - | 377.2M | - | 86.0% |
transboost-improving-the-best-imagenet | - | 25.56M | - | 81.15% |
maxvit-multi-axis-vision-transformer | - | 69M | - | 84.45% |
sp-vit-learning-2d-spatial-priors-for-vision | - | - | - | 85.5% |
transboost-improving-the-best-imagenet | - | 60.19M | - | 80.64% |
efficient-self-supervised-learning-with | - | - | - | 87.4% |
revbifpn-the-fully-reversible-bidirectional | - | 5.11M | - | 75.9% |
scarletnas-bridging-the-gap-between | - | 6.5M | - | 76.3% |
splitnet-divide-and-co-training | 98M | 83.34% | ||
fastervit-fast-vision-transformers-with | - | 53.4M | - | 83.2% |
fairnas-rethinking-evaluation-fairness-of | - | 4.6M | - | 75.34% |
dynamic-convolution-attention-over | - | 4M | - | 69.4% |
resmlp-feedforward-networks-for-image | - | 17.7M | - | 78.6% |
generalized-parametric-contrastive-learning | - | - | - | 86.01% |
co-training-2-l-submodels-for-visual | - | - | - | 85.8% |
mixpro-data-augmentation-with-maskmix-and | - | - | - | 73.8% |
spinenet-learning-scale-permuted-backbone-for | - | 60.5M | - | 79% |
unicom-universal-and-compact-representation | - | - | - | 88.3 |
billion-scale-semi-supervised-learning-for | - | 193M | - | 84.8% |
efficient-multi-order-gated-aggregation | - | 3M | - | 77.2% |
pattern-attention-transformer-with-doughnut | - | - | - | 83.6% |
fastervit-fast-vision-transformers-with | - | 159.5M | - | 84.9% |
reproducible-scaling-laws-for-contrastive | - | - | - | 88.5% |
which-transformer-to-favor-a-comparative | - | - | - | 82.11% |
190411491 | - | 14.7M | - | 75.7% |
graph-rise-graph-regularized-image-semantic | - | - | - | 68.29% |
muxconv-information-multiplexing-in | - | 2.4M | - | 71.6% |
mobilenets-efficient-convolutional-neural | - | - | - | 70.6% |
fractalnet-ultra-deep-neural-networks-without | - | - | - | 75.88% |
xcit-cross-covariance-image-transformers | - | 189M | - | 86% |
understanding-gaussian-attention-bias-of | - | - | - | 81.484% |
fixing-the-train-test-resolution-discrepancy | - | - | - | 79.1% |
scaling-vision-transformers-to-22-billion | - | 86M | - | 88.6% |
ghostnet-more-features-from-cheap-operations | - | 13M | - | 75% |
volo-vision-outlooker-for-visual-recognition | - | 296M | - | 87.1% |
dat-spatially-dynamic-vision-transformer-with | - | 53M | - | 84.6% |
convit-improving-vision-transformers-with | - | 86M | - | 82.4% |
efficientvit-enhanced-linear-attention-for | - | 64M | - | 85.6% |
mambavision-a-hybrid-mamba-transformer-vision | - | 227.9M | - | 85% |
greedynas-towards-fast-one-shot-nas-with | 6.5M | 77.1% | ||
visual-attention-network | - | 200M | - | 86.9% |
a-dot-product-attention-free-transformer | - | 23M | - | 80.8% |
scalable-visual-transformers-with | - | 5.74M | - | 69.64% |
on-the-performance-analysis-of-momentum | - | - | - | 76.91% |
levit-a-vision-transformer-in-convnet-s | - | 39.4M | - | 82.5% |
fixing-the-train-test-resolution-discrepancy | - | 25.6M | - | 82.5% |
scaling-up-your-kernels-to-31x31-revisiting | - | 335M | - | 87.8% |
contextual-transformer-networks-for-visual | - | 55.8M | - | 84.6% |
meta-knowledge-distillation | - | - | - | 83.1% |
involution-inverting-the-inherence-of | - | 34M | - | 79.3% |
tokens-to-token-vit-training-vision | - | - | - | 83.3% |
rethinking-the-design-principles-of-robust | - | 23.3M | - | 81.9% |
which-transformer-to-favor-a-comparative | - | - | - | 83.61% |
a-convnet-for-the-2020s | - | 1827M | - | 88.36% |
clcnet-rethinking-of-ensemble-modeling-with | - | - | - | 86.42% |
going-deeper-with-image-transformers | - | 438M | - | 86.5% |
three-things-everyone-should-know-about | - | - | - | 85.5% |
fbnetv5-neural-architecture-search-for | - | - | - | 84.1% |
sp-vit-learning-2d-spatial-priors-for-vision | - | - | - | 85.1% |
semi-supervised-recognition-under-a-noisy-and | - | 5.47M | - | 79.0% |
augmenting-convolutional-networks-with | - | 99.4M | - | 83.5% |
from-xception-to-nexception-new-design | - | - | - | 82% |
automix-unveiling-the-power-of-mixup | - | 21.8M | - | 76.1% |
balanced-binary-neural-networks-with-gated | - | - | - | 59.4% |
co-training-2-l-submodels-for-visual | - | - | - | 88.0% |
transnext-robust-foveal-visual-perception-for | - | 49.7M | - | 86.0% |
refiner-refining-self-attention-for-vision | 81M | 86.03 | ||
co-training-2-l-submodels-for-visual | - | - | - | 86.3% |
repvgg-making-vgg-style-convnets-great-again | - | 80.31M | - | 78.78% |
firecaffe-near-linear-acceleration-of-deep | - | - | - | 58.9% |
convmlp-hierarchical-convolutional-mlps-for | - | 42.7M | - | 80.2% |
モデル 191 | - | - | - | 78.15% |
internimage-exploring-large-scale-vision | - | 50M | - | 84.2% |
2103-15358 | - | 24.6M | - | 82% |
scaling-up-visual-and-vision-language | 480M | 88.64% | ||
from-xception-to-nexception-new-design | - | - | - | 81.8% |
gtp-vit-efficient-vision-transformers-via | - | - | - | 83.7% |
multigrain-a-unified-image-embedding-for | - | - | - | 79.4% |
zen-nas-a-zero-shot-nas-for-high-performance | - | 183M | - | 83.0% |
kolmogorov-arnold-transformer | - | 86.6M | - | 82.8 |
a-fast-knowledge-distillation-framework-for | - | 5M | - | 78.7% |
hyenapixel-global-image-context-with | - | - | - | 83.5% |
gtp-vit-efficient-vision-transformers-via | - | - | - | 81.9% |
rethinking-and-improving-relative-position | - | - | - | 81.1% |
which-transformer-to-favor-a-comparative | - | - | - | 81.33% |
one-peace-exploring-one-general | - | 1520M | - | - |
moat-alternating-mobile-convolution-and | - | 483.2M | - | 89.1% |
sequencer-deep-lstm-for-image-classification | - | 28M | - | 82.3% |
dilated-neighborhood-attention-transformer | - | 51M | - | 83.8% |
tinyvit-fast-pretraining-distillation-for | - | 11M | - | 81.5% |
モデル 210 | - | - | - | 78.75% |
incorporating-convolution-designs-into-visual | - | - | - | 82% |
high-performance-large-scale-image | - | 527M | - | 89.2% |
self-training-with-noisy-student-improves | - | 66M | - | 86.9% |
cswin-transformer-a-general-vision | - | 173M | - | 87.5% |
debiformer-vision-transformer-with-deformable | - | - | - | 84.4% |
cutmix-regularization-strategy-to-train | - | - | - | 78.4% |
mambavision-a-hybrid-mamba-transformer-vision | - | 97.7M | - | 84.2% |
expeditious-saliency-guided-mix-up-through | - | - | - | 77.39% |
nasvit-neural-architecture-search-for | - | - | - | 78.2% |
volo-vision-outlooker-for-visual-recognition | - | 59M | - | 86% |
wave-vit-unifying-wavelet-and-transformers | - | 33.5M | - | 84.8% |
alphanet-improved-training-of-supernet-with | - | - | - | 78.9% |
nasvit-neural-architecture-search-for | - | - | - | 81.0% |
mixpro-data-augmentation-with-maskmix-and | - | - | - | 82.9% |
omnivec2-a-novel-transformer-based-network | - | - | - | 89.3% |
transboost-improving-the-best-imagenet | - | 71.71M | - | 82.16% |
rexnet-diminishing-representational | - | 2.7M | - | 74.6% |
next-vit-next-generation-vision-transformer | - | 57.8M | - | 84.7% |
vision-models-are-more-robust-and-fair-when | - | 10000M | - | 85.8% |
nasvit-neural-architecture-search-for | - | - | - | 81.4% |
masked-autoencoders-are-scalable-vision | - | - | - | 85.9% |
vitaev2-vision-transformer-advanced-by | - | 644M | - | 88.5% |
neighborhood-attention-transformer | - | 20M | - | 81.8% |
an-image-is-worth-16x16-words-transformers-1 | - | - | - | - |
revisiting-resnets-improved-training-and | 192M | 84.4% | ||
efficientnet-rethinking-model-scaling-for | - | 66M | - | 84.4% |
2103-15358 | - | 55.7M | - | 83.2% |
lets-keep-it-simple-using-simple | - | 1.5M | - | 61.52 |
vision-gnn-an-image-is-worth-graph-of-nodes | - | 10.7M | - | 78.2% |
adaptive-split-fusion-transformer | - | 56.7M | - | 83.9% |
gtp-vit-efficient-vision-transformers-via | - | - | - | 85.8% |
uninet-unified-architecture-search-with-1 | - | 11.5M | - | 80.8% |
augmenting-convolutional-networks-with | - | 99.4M | - | 86.5% |
mixpro-data-augmentation-with-maskmix-and | - | - | - | 80.6% |
going-deeper-with-image-transformers | - | 68.2M | - | 85.4% |
which-transformer-to-favor-a-comparative | - | - | - | 82.54% |
モデル 247 | - | - | - | 76.3 |
fastvit-a-fast-hybrid-vision-transformer | - | - | - | 84.9% |
davit-dual-attention-vision-transformers | - | 87.9M | - | 86.9% |
co-training-2-l-submodels-for-visual | - | - | - | 87.1% |
モデル 251 | - | - | - | 81.92% |
firecaffe-near-linear-acceleration-of-deep | - | - | - | 68.3% |
sp-vit-learning-2d-spatial-priors-for-vision | - | - | - | 84.9% |
efficientnet-rethinking-model-scaling-for | - | 9.2M | - | 79.8% |
efficientnet-rethinking-model-scaling-for | - | 12M | - | 81.1% |
exploring-the-limits-of-weakly-supervised | - | 466M | - | 85.1% |
correlated-input-dependent-label-noise-in | - | 68.6% | ||
not-all-images-are-worth-16x16-words-dynamic | - | - | - | 78.48% |
rethinking-and-improving-relative-position | - | - | - | 81.4% |
container-context-aggregation-network | - | 20M | - | 82% |
self-training-with-noisy-student-improves | - | 30M | - | 86.1% |
srm-a-style-based-recalibration-module-for | - | - | - | 78.47% |
densely-connected-convolutional-networks | - | - | - | 74.98% |
self-training-with-noisy-student-improves | - | 19M | - | 85.3% |
モデル 265 | - | - | - | 78.36% |
exploring-randomly-wired-neural-networks-for | - | 61.5M | - | 80.1% |
mambavision-a-hybrid-mamba-transformer-vision | - | 31.8M | - | 82.3% |
fast-vision-transformers-with-hilo-attention | - | 28M | - | 82% |
your-diffusion-model-is-secretly-a-zero-shot | - | - | - | 79.1% |
xcit-cross-covariance-image-transformers | - | 84M | - | 85.8% |
semi-supervised-learning-of-visual-features | - | - | - | 75.5% |
cutmix-regularization-strategy-to-train | - | - | - | 80.53% |
convit-improving-vision-transformers-with | 152M | 82.5% | ||
res2net-a-new-multi-scale-backbone | - | - | - | 78.59% |
exploring-target-representations-for-masked | - | - | - | 87.8% |
going-deeper-with-image-transformers | - | 271M | - | 86.3% |
rest-an-efficient-transformer-for-visual | - | 13.66M | - | 79.6% |
resnest-split-attention-networks | - | 27.5M | - | 80.64% |
mobilevitv3-mobile-friendly-vision | - | 5.8M | - | 79.3% |
high-performance-large-scale-image | - | 254.9M | - | 85.7% |
visual-attention-network | - | 60M | - | 86.6% |
tokenmixup-efficient-attention-guided-token | - | - | - | 82.37% |
wide-residual-networks | - | - | - | 78.1% |
unsupervised-data-augmentation-1 | - | - | - | 79.04% |
mobilenetv4-universal-models-for-the-mobile | - | - | - | 79.9% |
parametric-contrastive-learning | - | - | - | 80.9% |
designing-network-design-spaces | - | 6.3M | - | 76.3% |
adversarial-autoaugment-1 | - | - | - | 79.4% |
mixmim-mixed-and-masked-image-modeling-for | - | 88M | - | 85.1% |
neighborhood-attention-transformer | - | 28M | - | 83.2% |
mixpro-data-augmentation-with-maskmix-and | - | - | - | 82.7% |
hornet-efficient-high-order-spatial | - | - | - | 87.7% |
vitae-vision-transformer-advanced-by | - | 13.2M | - | 81% |
going-deeper-with-image-transformers | - | 46.9M | - | 85.1% |
unireplknet-a-universal-perception-large | - | - | - | 87.4% |
designing-bert-for-convolutional-networks | - | 198M | - | 86.0% |
swin-transformer-hierarchical-vision | - | 88M | - | 86.4% |
モデル 298 | - | - | - | 77.71% |
quantnet-learning-to-quantize-by-learning | - | - | - | 71.97% |
adversarial-examples-improve-image | - | 66M | - | 85.2% |
whats-hidden-in-a-randomly-weighted-neural | - | 20.6M | - | 73.3% |
involution-inverting-the-inherence-of | - | 15.5M | - | 78.4% |
elsa-enhanced-local-self-attention-for-vision | - | 28M | - | 82.7% |
next-vit-next-generation-vision-transformer | - | 31.7M | - | 82.5% |
muxconv-information-multiplexing-in | - | 4.0M | - | 76.6% |
alphanet-improved-training-of-supernet-with | - | - | - | 80.3% |
2103-14899 | - | 44.3M | - | 82.8% |
sliced-recursive-transformer-1 | - | 21.3M | - | 84.3% |
resmlp-feedforward-networks-for-image | - | - | - | 79.4% |
mobilevit-light-weight-general-purpose-and | - | 5.6M | - | 78.4% |
masked-image-residual-learning-for-scaling-1 | - | 96M | - | 84.8% |
neighborhood-attention-transformer | - | 90M | - | 84.3% |
bottleneck-transformers-for-visual | - | 49.2M | - | 81.4% |
pvtv2-improved-baselines-with-pyramid-vision | - | 45.2M | - | 83.2% |
scalable-pre-training-of-large-autoregressive | - | - | - | 84.0 |
distilled-gradual-pruning-with-pruned-fine | - | 2.56M | - | 73.66% |
uninet-unified-architecture-search-with-1 | - | 72.9M | - | 87% |
fastvit-a-fast-hybrid-vision-transformer | - | - | - | 80.6% |
gated-convolutional-networks-with-hybrid | - | 42.2M | - | 80.5% |
augmenting-sub-model-to-improve-main-model | - | 632M | - | 85.7% |
fbnetv5-neural-architecture-search-for | - | - | - | 77.2% |
tiny-models-are-the-computational-saver-for | - | - | - | 85.75 |
mish-a-self-regularized-non-monotonic-neural | - | - | - | 79.8% |
differentiable-model-compression-via-pseudo | - | 82.0 | ||
graph-convolutions-enrich-the-self-attention | - | - | - | 82.8% |
model-soups-averaging-weights-of-multiple | - | 1843M | - | 90.94% |
torchdistill-a-modular-configuration-driven | - | - | - | 70.52% |
filter-response-normalization-layer | - | - | - | 77.21% |
scarletnas-bridging-the-gap-between | - | 6M | - | 75.6% |
cas-vit-convolutional-additive-self-attention | - | 21.76 | - | 84.1% |
lets-keep-it-simple-using-simple | - | 9.5M | - | 74.17 |
fast-autoaugment | - | - | - | 80.6% |
self-training-with-noisy-student-improves | - | 12M | - | 84.1% |
uninet-unified-architecture-search-with | - | 11.9M | - | 79.1% |
lets-keep-it-simple-using-simple | - | 1.5M | - | 69.11 |
involution-inverting-the-inherence-of | - | 9.2 | - | 75.9% |
not-all-images-are-worth-16x16-words-dynamic | - | 80.43% | ||
uninet-unified-architecture-search-with | - | 73.5M | - | 85.2% |
omnivore-a-single-model-for-many-visual | - | - | - | 86.0% |
densely-connected-convolutional-networks | - | - | - | 76.2% |
efficientnet-rethinking-model-scaling-for | - | 5.3M | - | 76.3% |
rethinking-local-perception-in-lightweight | - | 12.3M | - | 81.6% |
compact-global-descriptor-for-neural-networks | - | 4.26M | - | 72.56% |
improved-multiscale-vision-transformers-for | - | 667M | - | 88% |
revisiting-a-knn-based-image-classification | - | - | - | 79.8% |
visual-representation-learning-from-unlabeled | - | - | - | 85% |
efficientvit-enhanced-linear-attention-for | - | 64M | - | 86% |
billion-scale-semi-supervised-learning-for | - | 42M | - | 83.4% |
incepformer-efficient-inception-transformer | - | 14.0M | - | 80.5% |
fast-vision-transformers-with-hilo-attention | - | 87M | - | 84.7% |
rexnet-diminishing-representational | - | 34.8M | - | 84.5% |
identity-mappings-in-deep-residual-networks | - | - | - | 79.9% |
mixpro-data-augmentation-with-maskmix-and | - | - | - | 81.2% |
maxvit-multi-axis-vision-transformer | - | - | - | 86.7% |
asymmnet-towards-ultralight-convolution | - | 3.1M | - | 68.4% |
モデル 356 | - | 62M | - | 63.3 |
three-things-everyone-should-know-about | - | - | - | 84.1% |
mobilevitv3-mobile-friendly-vision | - | 1.2M | - | 70.98% |
edgenext-efficiently-amalgamated-cnn | - | 1.3M | - | 71.2% |
distilled-gradual-pruning-with-pruned-fine | - | 1.03M | - | 65.59% |
sparse-mlp-for-image-recognition-is-self | - | 65.9M | - | 83.4% |
tokens-to-token-vit-training-vision | - | 64.4M | - | 82.6% |
bias-loss-for-mobile-neural-networks | - | 5.5M | - | 76.2% |
metaformer-baselines-for-vision | - | 27M | - | 85.0% |
transboost-improving-the-best-imagenet | - | - | - | 79.03% |
graph-convolutions-enrich-the-self-attention | - | - | - | 83% |
an-improved-one-millisecond-mobile-backbone | - | 10.1M | - | 80.0% |
torchdistill-a-modular-configuration-driven | - | - | - | 71.56% |
on-the-performance-analysis-of-momentum | - | - | - | 67.74% |
hrformer-high-resolution-transformer-for | - | 50.3M | - | 82.8% |
collaboration-of-experts-achieving-80-top-1 | - | - | - | 81.5% |
metaformer-baselines-for-vision | - | 40M | - | 86.4% |
incorporating-convolution-designs-into-visual | - | 24.2M | - | 83.3% |
bossnas-exploring-hybrid-cnn-transformers | - | - | - | 82.2% |
efficientnetv2-smaller-models-and-faster | - | 22M | - | 84.9% |
unireplknet-a-universal-perception-large | - | - | - | 86.4% |
torchdistill-a-modular-configuration-driven | - | - | - | 70.93% |
densenets-reloaded-paradigm-shift-beyond | - | 50M | - | 83.7% |
deepmad-mathematical-architecture-design-for | - | 89M | - | 84% |
involution-inverting-the-inherence-of | - | 25.6M | - | 79.1% |
uninet-unified-architecture-search-with | - | 73.5M | - | 84.2% |
automix-unveiling-the-power-of-mixup | - | 25.6M | - | 79.25% |
fixing-the-train-test-resolution-discrepancy | 62G | 829M | - | 86.4% |
wavemix-lite-a-resource-efficient-neural | - | - | - | 74.93% |
fast-autoaugment | - | - | - | 77.6% |
activemlp-an-mlp-like-architecture-with | - | 27.2M | - | 82% |
a-dot-product-attention-free-transformer | - | 20.3M | - | 80.2% |
uninet-unified-architecture-search-with | - | 14M | - | 80.4% |
going-deeper-with-image-transformers | - | 185.9M | - | 85.8% |
maxvit-multi-axis-vision-transformer | - | - | - | 89.53% |
maxup-a-simple-way-to-improve-generalization | 87.42M | 85.8% | ||
scaling-local-self-attention-for-parameter | - | 87M | - | 85.5% |
global-context-vision-transformers | - | 20M | - | 82.0% |
meal-v2-boosting-vanilla-resnet-50-to-80-top | - | - | - | 80.67% |
ghostnet-more-features-from-cheap-operations | - | 2.6M | - | 66.2% |
self-training-with-noisy-student-improves | - | 5.3M | - | 78.8% |
convmlp-hierarchical-convolutional-mlps-for | - | 9M | - | 76.8 |
モデル 398 | - | - | - | 66.04 |
efficientnet-rethinking-model-scaling-for | - | 43M | - | 84% |
semi-supervised-learning-of-visual-features | - | - | - | 66.5% |
multigrain-a-unified-image-embedding-for | - | - | - | 83.1% |
metaformer-baselines-for-vision | - | 39M | - | 84.5% |
rethinking-and-improving-relative-position | - | 6M | - | 73.7% |
perceiver-general-perception-with-iterative | - | 44.9M | - | 78% |
rexnet-diminishing-representational | - | 34.7M | - | 82.8% |
maxvit-multi-axis-vision-transformer | - | 212M | - | 85.17% |
eca-net-efficient-channel-attention-for-deep | - | 57.40M | - | 78.92% |
vision-transformer-with-deformable-attention | - | 50M | - | 83.7% |
neighborhood-attention-transformer | - | 51M | - | 83.7% |
unconstrained-open-vocabulary-image | - | - | - | 88.21% |
bottleneck-transformers-for-visual | - | 44.4M | - | 80% |
parametric-contrastive-learning | - | - | - | 81.8% |
metaformer-baselines-for-vision | - | 39M | - | 85.8% |
which-transformer-to-favor-a-comparative | - | - | - | 81.96% |
an-algorithm-for-routing-vectors-in-sequences | - | 312.8M | - | 86.7% |
global-context-vision-transformers | - | 28M | - | 83.4% |
torchdistill-a-modular-configuration-driven | - | 71.71% | ||
maxvit-multi-axis-vision-transformer | - | - | - | 86.4% |
deformable-kernels-adapting-effective | - | - | - | 78.5% |
internimage-exploring-large-scale-vision | - | 97M | - | 84.9% |
resnest-split-attention-networks | - | 48M | - | 83.0% |
sparse-mlp-for-image-recognition-is-self | - | 24.1M | - | 81.9% |
autodropout-learning-dropout-patterns-to | - | 80.3% | ||
coatnet-marrying-convolution-and-attention | - | 75M | - | 84.1% |
adaptively-connected-neural-networks | - | 29.38M | - | 77.5% |
mixpro-data-augmentation-with-maskmix-and | - | - | - | 83.7% |
rexnet-diminishing-representational | - | 4.8M | - | 77.9% |
boosting-discriminative-visual-representation | - | 11.7M | - | 72.33% |
cas-vit-convolutional-additive-self-attention | - | 3.2 | - | 78.7% |
an-improved-one-millisecond-mobile-backbone | - | 14.8M | - | 81.4% |
the-information-pathways-hypothesis | - | - | - | 81.89% |
enhance-the-visual-representation-via | - | - | - | 87.02% |
polyloss-a-polynomial-expansion-perspective-1 | - | - | - | 87.2% |
wave-vit-unifying-wavelet-and-transformers | - | 22.7M | - | 83.9% |
global-context-vision-transformers | - | 90M | - | 84.5% |
mixpro-data-augmentation-with-maskmix-and | - | - | - | 82.8% |
efficient-multi-order-gated-aggregation | - | 83M | - | 84.7% |
multimodal-autoregressive-pre-training-of | - | - | - | 88.5% |
aggregating-nested-transformers | - | 17M | - | 81.5% |
fairnas-rethinking-evaluation-fairness-of | - | 4.4M | - | 74.69% |
scarletnas-bridging-the-gap-between | 12G | 27.8M | 0.42G | 82.3% |
coatnet-marrying-convolution-and-attention | - | - | - | 87.6% |
self-training-with-noisy-student-improves | - | 7.8M | - | 81.5% |
gtp-vit-efficient-vision-transformers-via | - | - | - | 82.8% |
モデル 445 | - | - | - | 81.12% |
rethinking-local-perception-in-lightweight | - | 7.2M | - | 79.8% |
metaformer-baselines-for-vision | - | 26M | - | 83.6% |
maxvit-multi-axis-vision-transformer | - | - | - | 88.69% |
metaformer-baselines-for-vision | - | 100M | - | 85.7% |
bias-loss-for-mobile-neural-networks | - | 7.1M | - | 77.1% |
high-performance-large-scale-image | - | 71.5M | - | 83.6% |
metaformer-baselines-for-vision | - | 57M | - | 84.5% |
augmenting-convolutional-networks-with | - | 188.6M | - | 84.1% |
revbifpn-the-fully-reversible-bidirectional | - | 3.42M | - | 72.8% |
metaformer-baselines-for-vision | - | 40M | - | 84.1% |
cvt-introducing-convolutions-to-vision | - | - | - | 81.6% |
reversible-column-networks | - | 2158M | - | 90.0% |
wave-vit-unifying-wavelet-and-transformers | - | 57.5M | - | 85.5% |
tokens-to-token-vit-training-vision | - | 21.5M | - | 81.5% |
metaformer-baselines-for-vision | - | 27M | - | 83.0% |
attentional-feature-fusion | 34.7M | 80.22% | ||
three-things-everyone-should-know-about | - | - | - | 84.1% |
dynamic-convolution-attention-over | - | 11.1M | - | 74.4% |
pvtv2-improved-baselines-with-pyramid-vision | - | 3.4M | - | 70.5% |
2103-15358 | - | 39.8M | - | 82.9% |
adaptive-split-fusion-transformer | - | 19.3M | - | 82.7% |
multiscale-vision-transformers | - | 72.9M | - | 84.8% |
boosting-discriminative-visual-representation | - | 44.6M | - | 81.08% |
モデル 469 | - | - | - | 78.15% |
generalized-parametric-contrastive-learning | - | - | - | 84.0% |
davit-dual-attention-vision-transformers | - | 28.3M | - | 82.8% |
contextual-classification-using-self | - | 77.0% | ||
aggregating-nested-transformers | - | 38M | - | 83.3% |
learned-queries-for-efficient-local-attention | - | 25M | - | 83.2% |
generalized-parametric-contrastive-learning | - | - | - | 79.7% |
gated-attention-coding-for-training-high | - | - | - | 70.42 |
maxvit-multi-axis-vision-transformer | - | - | - | 88.46% |
dilated-neighborhood-attention-transformer | - | - | - | 87.4% |
model-soups-averaging-weights-of-multiple | - | 2440M | - | 90.98% |
uniformer-unifying-convolution-and-self | - | 100M | - | 86.3% |
2103-14899 | - | 43.3M | - | 82.5% |
dat-spatially-dynamic-vision-transformer-with | - | 24M | - | 83.9% |
global-filter-networks-for-image | 54M | 82.9% | ||
nasvit-neural-architecture-search-for | - | - | - | 79.7% |
unireplknet-a-universal-perception-large | - | - | - | 83.2% |
sparse-mlp-for-image-recognition-is-self | - | 48.6M | - | 83.1% |
colornet-investigating-the-importance-of | - | - | - | 82.35% |
volo-vision-outlooker-for-visual-recognition | - | 27M | - | 85.2% |
fastervit-fast-vision-transformers-with | - | 75.9M | - | 84.2% |
exploring-the-limits-of-weakly-supervised | - | 194M | - | 84.2% |
nasvit-neural-architecture-search-for | - | - | - | 80.5% |
deit-iii-revenge-of-the-vit | - | - | - | 85.7% |
compress-image-to-patches-for-vision | - | - | - | 77% |
discrete-representations-strengthen-vision-1 | - | - | - | 85.07% |
which-transformer-to-favor-a-comparative | - | - | - | 83.65% |
mamba2d-a-natively-multi-dimensional-state | - | - | - | 82.4% |
lip-local-importance-based-pooling | - | 42.9M | - | 79.33% |
coca-contrastive-captioners-are-image-text | - | 2100M | - | 91.0% |
masked-image-residual-learning-for-scaling-1 | - | 341M | - | 86.2% |
rethinking-the-design-principles-of-robust | - | 10.9M | - | 79.2% |
large-scale-learning-of-general-visual | - | 928M | - | 85.39% |
co-training-2-l-submodels-for-visual | - | - | - | 86.2% |
efficient-multi-order-gated-aggregation | - | 25M | - | 83.4% |
when-vision-transformers-outperform-resnets | - | 236M | - | 81.1% |
unireplknet-a-universal-perception-large | - | - | - | 88% |
rexnet-diminishing-representational | - | 16.5M | - | 83.2% |
dat-spatially-dynamic-vision-transformer-with | - | 93M | - | 84.9% |
spatial-group-wise-enhance-improving-semantic | - | 25.56M | - | 77.584% |
uniformer-unifying-convolution-and-self | - | 22M | - | 83.4% |
augmenting-sub-model-to-improve-main-model | - | 304M | - | 85.3% |
averaging-weights-leads-to-wider-optima-and | - | - | - | 78.44% |
metaformer-baselines-for-vision | - | 100M | - | 87.6% |
deit-iii-revenge-of-the-vit | - | - | - | 81.4% |
pay-attention-to-mlps | - | 73M | - | 81.6% |
the-effectiveness-of-mae-pre-pretraining-for | - | 6500M | - | 90.1% |
maxvit-multi-axis-vision-transformer | - | - | - | 88.7% |
xcit-cross-covariance-image-transformers | - | 48M | - | 85.6% |
visual-attention-network | - | - | - | 87% |
efficientnet-rethinking-model-scaling-for | - | 19M | - | 82.6% |
revisiting-resnets-improved-training-and | - | - | - | 83.8% |
multigrain-a-unified-image-embedding-for | - | - | - | 82.6% |
understanding-the-robustness-in-vision | - | 76.8M | - | 87.1% |
zen-nas-a-zero-shot-nas-for-high-performance | - | 5.7M | - | 78% |
learning-transferable-architectures-for | 1648G | 88.9M | 2.38G | 82.7% |
rexnet-diminishing-representational | - | 19M | - | 81.6% |
vision-gnn-an-image-is-worth-graph-of-nodes | - | 27.3M | - | 82.1% |
dat-spatially-dynamic-vision-transformer-with | - | 94M | - | 85.9% |
deit-iii-revenge-of-the-vit | - | - | - | 83.8% |
モデル 529 | - | - | - | 67.63 |
bnn-bn-training-binary-neural-networks | - | 68.0% | ||
fixing-the-train-test-resolution-discrepancy-2 | - | - | - | 85.7% |
mambavision-a-hybrid-mamba-transformer-vision | - | 35.1M | - | 82.7% |
attentive-normalization | - | - | - | 81.87% |
alphanet-improved-training-of-supernet-with | - | - | - | 80.8% |
metaformer-baselines-for-vision | - | 27M | - | 83.7% |
semi-supervised-recognition-under-a-noisy-and | 76M | 85.1% | ||
hrformer-high-resolution-transformer-for | - | 8.0M | - | 78.5% |
which-transformer-to-favor-a-comparative | - | - | - | 81.09% |
transboost-improving-the-best-imagenet | - | 11.69M | - | 73.36% |
metaformer-baselines-for-vision | - | 26M | - | 85.0% |
a-large-batch-optimizer-reality-check | - | 75.92% | ||
lets-keep-it-simple-using-simple | - | 3M | - | 68.15 |
deep-residual-learning-for-image-recognition | - | 25M | - | 75.3% |
centroid-transformers-learning-to-abstract | - | 22.3M | - | 80.9% |
transnext-robust-foveal-visual-perception-for | - | 89.7M | - | 86.2% |
going-deeper-with-image-transformers | - | 89.5M | - | 85.3% |
dilated-neighborhood-attention-transformer | - | 200M | - | 87.5% |
supervised-contrastive-learning | - | 80.8% | ||
metaformer-baselines-for-vision | - | 57M | - | 86.1% |
asymmetric-masked-distillation-for-pre | - | 87M | - | 84.6% |
davit-dual-attention-vision-transformers | - | 87.9M | - | 84.6% |
tiny-models-are-the-computational-saver-for | - | - | - | 85.24 |
mobilevitv3-mobile-friendly-vision | - | - | - | 78.64% |
internimage-exploring-large-scale-vision | - | 3000M | - | 90.1% |
deit-iii-revenge-of-the-vit | - | - | - | 83.1% |
when-shift-operation-meets-vision-transformer | - | 28M | - | 81.7% |
grafit-learning-fine-grained-image | - | 79.6% | ||
mnasnet-platform-aware-neural-architecture | - | 3.9M | - | 75.2% |
tokenlearner-what-can-8-learned-tokens-do-for | - | - | - | 87.07% |
neural-architecture-transfer | - | 9.1M | - | 80.5% |
distilling-out-of-distribution-robustness-1 | - | - | - | 81.9% |
dynamicvit-efficient-vision-transformers-with | 57.1M | 83.9 | ||
rethinking-spatial-dimensions-of-vision | - | 10.6M | - | 79.1% |
bag-of-tricks-for-image-classification-with | - | 25M | - | 77.16% |
deit-iii-revenge-of-the-vit | - | - | - | 84.9% |
mnasnet-platform-aware-neural-architecture | 5.2M | 0.0403G | 76.7% | |
fixing-the-train-test-resolution-discrepancy-2 | - | 19M | - | 85.9% |
efficientnet-rethinking-model-scaling-for | - | 7.8M | - | 78.8% |
モデル 569 | - | - | - | 81.97% |
localvit-bringing-locality-to-vision | - | 4.3M | - | 72.5% |
tinyvit-fast-pretraining-distillation-for | - | 21M | - | 86.2% |
single-path-nas-designing-hardware-efficient | - | - | - | 74.96% |
volo-vision-outlooker-for-visual-recognition | - | 86M | - | 86.3% |
levit-a-vision-transformer-in-convnet-s | - | 17.8M | - | 81.6% |
multigrain-a-unified-image-embedding-for | - | - | - | 75.1% |
fixing-the-train-test-resolution-discrepancy-2 | 480M | 88.5% | ||
dynamic-convolution-attention-over | - | 2.8M | - | 64.9% |
deit-iii-revenge-of-the-vit | - | 304.8M | - | 85.8% |
high-performance-large-scale-image | - | 438.4M | - | 86.5% |
deepvit-towards-deeper-vision-transformer | - | 55M | - | 82.2% |
biformer-vision-transformer-with-bi-level | - | - | - | 81.4% |
metaformer-baselines-for-vision | - | 26M | - | 85.4% |
selective-kernel-networks | - | 48.9M | - | 79.81% |
going-deeper-with-image-transformers | - | 17.3M | - | 82.2% |
unireplknet-a-universal-perception-large | - | - | - | 87.9% |
efficientnetv2-smaller-models-and-faster | - | - | - | 85.7% |
visformer-the-vision-friendly-transformer | - | 40.2M | - | 82.2% |
boosting-discriminative-visual-representation | - | 21.8M | - | 76.35% |
online-training-through-time-for-spiking | - | - | - | 65.15% |
pattern-attention-transformer-with-doughnut | - | - | - | 83.1% |
unconstrained-open-vocabulary-image | - | - | - | 83.46% |
improved-multiscale-vision-transformers-for | - | 667M | - | 88.8% |
spatial-channel-token-distillation-for-vision | - | 22.2M | - | 75.7% |
transboost-improving-the-best-imagenet | - | 21.8M | - | 76.70% |
tresnet-high-performance-gpu-dedicated | 77M | 84.3% | ||
metaformer-baselines-for-vision | - | 99M | - | 86.4% |
vision-gnn-an-image-is-worth-graph-of-nodes | - | 51.7M | - | 83.1% |
maxvit-multi-axis-vision-transformer | - | - | - | 86.34% |
mobilevit-light-weight-general-purpose-and | - | 2.3M | - | 74.8% |
mixpro-data-augmentation-with-maskmix-and | - | - | - | 84.1% |
visual-attention-network | - | 26.6M | - | 82.8% |
visual-attention-network | - | 90M | - | 86.3% |
unireplknet-a-universal-perception-large | - | - | - | 83.9% |
levit-a-vision-transformer-in-convnet-s | - | 4.7M | - | 75.7% |
lambdanetworks-modeling-long-range-1 | - | 35M | - | 84.0% |
rckd-response-based-cross-task-knowledge | - | 3M | - | 78.6 |
incorporating-convolution-designs-into-visual | - | - | - | 78.8% |
greedynas-towards-fast-one-shot-nas-with | - | 4.7M | - | 76.2% |
deeper-vs-wider-a-revisit-of-transformer | - | - | - | 84.2 |
2103-15358 | - | 6.7M | - | 76.7% |
fbnetv5-neural-architecture-search-for | - | - | - | 81.8% |
x-volution-on-the-unification-of-convolution | - | - | - | 75% |
metaformer-baselines-for-vision | - | 56M | - | 86.6% |
parametric-contrastive-learning | - | - | - | 81.3% |
re-labeling-imagenet-from-single-to-multi | - | 4.8M | - | 78.4% |
self-training-with-noisy-student-improves | - | 43M | - | 86.4% |
transboost-improving-the-best-imagenet | - | 44.55M | - | 79.86% |
swin-transformer-v2-scaling-up-capacity-and | - | 3000M | - | 90.17% |
deit-iii-revenge-of-the-vit | - | 87M | - | 85.0% |
mlp-mixer-an-all-mlp-architecture-for-vision | - | - | - | 85.3% |
container-context-aggregation-network | - | 22.1M | - | 82.7% |
graph-convolutions-enrich-the-self-attention | - | - | - | 81.1% |
contextual-convolutional-neural-networks | - | 60M | - | 79.03% |
tokens-to-token-vit-training-vision | - | - | - | 82.3% |
conformer-local-features-coupling-global | 83.3M | 84.1% | ||
differentiable-spike-rethinking-gradient | - | - | - | 71.24 |
which-transformer-to-favor-a-comparative | - | - | - | 78.42% |
sequencer-deep-lstm-for-image-classification | - | 54M | - | 83.4% |
mobilenetv4-universal-models-for-the-mobile | - | - | - | 80.7% |
which-transformer-to-favor-a-comparative | - | - | - | 84.91% |
collaboration-of-experts-achieving-80-top-1 | - | - | - | 80% |
fixing-the-train-test-resolution-discrepancy-2 | - | 19M | - | 84.0% |
co-training-2-l-submodels-for-visual | - | - | - | 85.8% |
efficientvit-enhanced-linear-attention-for | - | 49M | - | 84.2% |
designing-network-design-spaces | - | 6.1M | - | 75.5% |
beit-bert-pre-training-of-image-transformers | - | 331M | - | 88.60% |
lets-keep-it-simple-using-simple | - | 9.5M | - | 81.24 |
deeper-vs-wider-a-revisit-of-transformer | - | - | - | 87.1 |
fastervit-fast-vision-transformers-with | - | 424.6M | - | 85.4% |
tiny-models-are-the-computational-saver-for | - | - | - | 86.24 |
cas-vit-convolutional-additive-self-attention | - | 12.42 | - | 83.0% |
compounding-the-performance-improvements-of | - | - | - | 84.2% |
swin-transformer-hierarchical-vision | - | 197M | - | 87.3% |
towards-all-in-one-pre-training-via | - | - | - | 89.6% |
colornet-investigating-the-importance-of | - | - | - | 84.32% |
nasvit-neural-architecture-search-for | - | - | - | 81.8% |
boosting-discriminative-visual-representation | - | 25.6M | - | 79.41% |
muxconv-information-multiplexing-in | - | 3.4M | - | 75.3% |
sp-vit-learning-2d-spatial-priors-for-vision | - | - | - | 86.3% |
localvit-bringing-locality-to-vision | - | 13.5M | - | 78.2% |
efficientnetv2-smaller-models-and-faster | - | - | - | 85.1% |
rethinking-spatial-dimensions-of-vision | - | 23.5M | - | 81.9% |
eca-net-efficient-channel-attention-for-deep | - | 24.37M | - | 77.48% |
progressive-neural-architecture-search | - | 86.1M | 2.5G | 82.9% |
deit-iii-revenge-of-the-vit | - | 22M | - | 83.4% |
levit-a-vision-transformer-in-convnet-s | - | 8.8M | - | 79.6% |
transformer-in-transformer | 65.6M | 83.9% | ||
when-vision-transformers-outperform-resnets | - | 64M | - | 79% |
resnest-split-attention-networks | - | 111M | - | 84.5% |
deit-iii-revenge-of-the-vit | - | - | - | 85.2% |
bottleneck-transformers-for-visual | - | 28.02M | - | 79.4% |
fbnet-hardware-aware-efficient-convnet-design | - | 5.5M | - | 74.9% |
dilated-neighborhood-attention-transformer | - | 197M | - | 87.4% |
maxvit-multi-axis-vision-transformer | - | - | - | 89.41% |
convmlp-hierarchical-convolutional-mlps-for | - | 17.4M | - | 79% |
asymmetric-masked-distillation-for-pre | - | 22M | - | 82.1% |
masked-autoencoders-are-scalable-vision | - | 656M | - | 87.8% |
mobilenetv4-universal-models-for-the-mobile | - | - | - | 73.8% |
circumventing-outliers-of-autoaugment-with | - | 66M | - | 85.5% |
asymmnet-towards-ultralight-convolution | - | 5.99M | - | 75.4% |
hvt-a-comprehensive-vision-framework-for | - | - | - | 80.1% |
190409460 | - | - | - | 77.8% |
mobilevitv3-mobile-friendly-vision | - | 1.4M | - | 72.33% |
shufflenet-an-extremely-efficient | - | - | - | 70.9% |
autoformer-searching-transformers-for-visual | - | 22.9M | - | 81.7% |
going-deeper-with-image-transformers | - | 270.9M | - | 86.1% |
transboost-improving-the-best-imagenet | - | 5.29M | - | 78.60% |
learning-visual-representations-for-transfer-1 | - | 76.71% | ||
which-transformer-to-favor-a-comparative | - | - | - | 78.34% |
meta-pseudo-labels | - | 390M | - | 90% |
unireplknet-a-universal-perception-large | - | - | - | 81.6% |
vitae-vision-transformer-advanced-by | - | 4.8M | - | 76.8% |
pvtv2-improved-baselines-with-pyramid-vision | - | 13.1M | - | 78.7% |
asymmnet-towards-ultralight-convolution | - | 2.8M | - | 69.2% |
scaling-vision-with-sparse-mixture-of-experts | - | 656M | - | 88.08% |
efficient-multi-order-gated-aggregation | - | 5.2M | - | 80% |
tinyvit-fast-pretraining-distillation-for | - | 5.4M | - | 80.7% |
aggregating-nested-transformers | - | 68M | - | 83.8% |
exploring-the-limits-of-weakly-supervised | - | 88M | - | 82.2% |
co-training-2-l-submodels-for-visual | - | - | - | 83.1% |
cvt-introducing-convolutions-to-vision | - | 18M | - | 82.2% |
uniformer-unifying-convolution-and-self | - | 100M | - | 85.6% |
swin-transformer-hierarchical-vision | - | 29M | - | 81.3% |
resmlp-feedforward-networks-for-image | - | 15.4M | - | 77.8% |
metaformer-baselines-for-vision | - | 40M | - | 85.4% |
a-simple-episodic-linear-probe-improves | - | - | - | 76.13 |
vicinity-vision-transformer | - | 61.8M | - | 84.1% |
patches-are-all-you-need-1 | - | 51.6M | - | 82.20 |
dicenet-dimension-wise-convolutions-for | - | - | - | 75.1% |
perceiver-general-perception-with-iterative | - | - | - | 76.4% |
davit-dual-attention-vision-transformers | - | 196.8M | - | 87.5% |
omnivore-a-single-model-for-many-visual | - | - | - | 85.3% |
autodropout-learning-dropout-patterns-to | - | - | - | 78.7% |
a-convnet-for-the-2020s | - | 198M | - | 85.5% |
efficientvit-enhanced-linear-attention-for | - | - | - | 83.5% |
resnet-strikes-back-an-improved-training | - | 22M | - | 80.4% |
eva-exploring-the-limits-of-masked-visual | - | 1000M | - | 89.7% |
multiscale-deep-equilibrium-models | 81M | 79.2% | ||
when-shift-operation-meets-vision-transformer | - | 50M | - | 82.8% |
exploring-randomly-wired-neural-networks-for | - | 5.6M | - | 74.7% |
splitnet-divide-and-co-training | - | 98M | - | 83.6% |
three-things-everyone-should-know-about | - | - | - | 82.6% |
alphanet-improved-training-of-supernet-with | - | - | - | 79.1% |
dynamic-convolution-attention-over | - | 7M | - | 72.8% |
x-volution-on-the-unification-of-convolution | - | 76.6% | ||
global-context-vision-transformers | - | 12M | - | 79.8% |
uninet-unified-architecture-search-with-1 | - | 117M | - | 87.4% |
shufflenet-v2-practical-guidelines-for | - | - | - | 75.4% |
dropblock-a-regularization-method-for | - | - | - | 78.35% |
greedynas-towards-fast-one-shot-nas-with | - | 5.2M | - | 76.8% |
alphanet-improved-training-of-supernet-with | - | - | - | 79.4% |
differentially-private-image-classification | - | - | - | 88.9% |
escaping-the-big-data-paradigm-with-compact | - | 22.36M | - | - |
visual-attention-network | - | 13.9M | - | 81.1% |
fbnetv5-neural-architecture-search-for | - | - | - | 81.7% |
multigrain-a-unified-image-embedding-for | - | - | - | 78.2% |
a-fast-knowledge-distillation-framework-for | - | - | - | 81.9% |
fast-vision-transformers-with-hilo-attention | - | 49M | - | 83.3% |
resnet-strikes-back-an-improved-training | - | 60.2M | - | 82.4% |
scarletnas-bridging-the-gap-between | - | 6.7M | - | 76.9% |
fastvit-a-fast-hybrid-vision-transformer | - | - | - | 79.8% |
maxvit-multi-axis-vision-transformer | - | 31M | - | 83.62% |
an-improved-one-millisecond-mobile-backbone | - | 10.1M | - | 78.1% |
2103-14899 | - | 28.2M | - | 82.3% |
randaugment-practical-data-augmentation-with | - | 66M | - | 85% |
mixnet-mixed-depthwise-convolutional-kernels | - | 7.3M | - | 78.9% |
sliced-recursive-transformer-1 | - | 4.8M | - | 77.6% |
muxconv-information-multiplexing-in | - | 1.8M | - | 66.7% |
clcnet-rethinking-of-ensemble-modeling-with | - | - | - | 83.88% |
high-performance-large-scale-image | - | 132.6M | - | 84.7% |
model-rubik-s-cube-twisting-resolution-depth | - | 11.9M | - | 79.4% |
tokens-to-token-vit-training-vision | - | - | - | 81.9% |
dynamic-convolution-attention-over | - | 4.8M | - | 69.7% |
moat-alternating-mobile-convolution-and | - | 190M | - | 86.7% |
go-wider-instead-of-deeper | - | 40M | - | 79.49% |
repvgg-making-vgg-style-convnets-great-again | - | 55.77M | - | 78.5% |
metaformer-baselines-for-vision | - | 39M | - | 86.9% |
co-training-2-l-submodels-for-visual | - | - | - | 84.2% |
metaformer-baselines-for-vision | - | 26M | - | 84.1% |
davit-dual-attention-vision-transformers | - | 362M | - | 90.2% |
transboost-improving-the-best-imagenet | - | 5.48M | - | 76.81% |
glit-neural-architecture-search-for-global | - | 96.1M | - | 82.3% |
pyramidal-convolution-rethinking | 42.3M | 81.49% | ||
spatial-channel-token-distillation-for-vision | - | 30.1M | - | 82.1% |
high-performance-large-scale-image | - | 316.1M | - | 85.9% |
multimodal-autoregressive-pre-training-of | - | 2700M | - | - |
vision-transformer-with-deformable-attention | - | 29M | - | 82.0% |
eca-net-efficient-channel-attention-for-deep | - | 3.34M | - | 72.56% |
maxvit-multi-axis-vision-transformer | - | - | - | 89.12% |
three-things-everyone-should-know-about | - | - | - | 84.3% |
twins-revisiting-spatial-attention-design-in | - | 99.2M | - | 83.7% |
visual-attention-network | - | - | - | 85.7% |
autoformer-searching-transformers-for-visual | - | 5.7M | - | 74.7% |
fixing-the-train-test-resolution-discrepancy | - | - | - | 79.8% |
moat-alternating-mobile-convolution-and | - | 27.8M | - | 83.3% |
revbifpn-the-fully-reversible-bidirectional | - | 10.6M | - | 79% |
designing-network-design-spaces | - | 11.2M | - | 78% |
tiny-models-are-the-computational-saver-for | - | - | - | 83.52 |
efficientvit-enhanced-linear-attention-for | - | 53M | - | 84.5% |
rethinking-spatial-dimensions-of-vision | - | 73.8M | - | 84% |
visual-attention-network | - | 200M | - | 87.8% |
ghostnetv3-exploring-the-training-strategies | - | - | - | 69.4% |
edgenext-efficiently-amalgamated-cnn | - | 5.6M | - | 79.4% |
edgeformer-improving-light-weight-convnets-by | - | 5M | - | 78.63% |
billion-scale-semi-supervised-learning-for | - | 88M | - | 84.3% |
vision-gnn-an-image-is-worth-graph-of-nodes | - | 92.6M | - | 83.7% |
global-context-vision-transformers | - | 51M | - | 84.0% |
cvt-introducing-convolutions-to-vision | - | - | - | 87.7% |
fixing-the-train-test-resolution-discrepancy-2 | - | 66M | - | 87.1% |
autodropout-learning-dropout-patterns-to | - | - | - | 77.5% |
maxvit-multi-axis-vision-transformer | - | - | - | 85.72% |
designing-network-design-spaces | - | 39.2M | - | 79.9% |
ghostnet-more-features-from-cheap-operations | - | 6.5M | - | 74.1% |
involution-inverting-the-inherence-of | - | 12.4M | - | 77.6% |
rexnet-diminishing-representational | - | 9.7M | - | 80.3% |
adversarial-examples-improve-image | - | 88M | - | 85.5% |
gtp-vit-efficient-vision-transformers-via | - | - | - | 81.5% |
モデル 788 | - | - | - | 81.16% |
resmlp-feedforward-networks-for-image | - | 45M | - | 79.7% |
unireplknet-a-universal-perception-large | - | - | - | 80.2% |
biformer-vision-transformer-with-bi-level | - | - | - | 85.4% |
tokenlearner-what-can-8-learned-tokens-do-for | - | 460M | - | 88.87% |
lets-keep-it-simple-using-simple | - | 5.7M | - | 71.94 |
rest-an-efficient-transformer-for-visual | - | 51.63M | - | 83.6% |
ghostnet-more-features-from-cheap-operations | - | 7.3M | - | 75.7% |
co-training-2-l-submodels-for-visual | - | - | - | 85.0% |
inception-v4-inception-resnet-and-the-impact | - | 55.8M | - | 80.1% |
deep-residual-learning-for-image-recognition | - | - | - | 78.57% |
distilled-gradual-pruning-with-pruned-fine | - | 1.15M | - | 65.22 |
sliced-recursive-transformer-1 | - | 4M | - | 74.0% |
beit-bert-pre-training-of-image-transformers | - | 86M | - | 86.3% |
metaformer-baselines-for-vision | - | 56M | - | 87.5% |
designing-network-design-spaces | - | 4.3M | - | 74.1% |
モデル 804 | - | - | - | 78.62% |
localvit-bringing-locality-to-vision | - | 6.3M | - | 75.9% |
metaformer-baselines-for-vision | - | 56M | - | 86.2% |
coatnet-marrying-convolution-and-attention | - | - | - | 88.52% |
tinyvit-fast-pretraining-distillation-for | - | 21M | - | 83.1% |
debiformer-vision-transformer-with-deformable | - | - | - | 81.9% |
metaformer-baselines-for-vision | - | 100M | - | 87.0% |
going-deeper-with-image-transformers | - | 38.6M | - | 84.8% |
drawing-multiple-augmentation-samples-per | 377.2M | 86.78% | ||
multimodal-autoregressive-pre-training-of | - | - | - | 89.5% |
scaling-vision-transformers-to-22-billion | - | 307M | - | 89.6% |
metaformer-is-actually-what-you-need-for | - | 73M | - | 82.5% |
visual-attention-network | - | 4.1M | - | 75.4% |
gtp-vit-efficient-vision-transformers-via | - | - | - | 79.5% |
sliced-recursive-transformer-1 | - | 21M | - | 83.8% |
ghostnet-more-features-from-cheap-operations | - | 5.2M | - | 73.9% |
lip-local-importance-based-pooling | - | 25.8M | - | 78.15% |
elsa-enhanced-local-self-attention-for-vision | - | 27M | - | 84.7% |
mnasnet-platform-aware-neural-architecture | - | 4.8M | - | 75.6% |
glit-neural-architecture-search-for-global | - | 7.2M | - | 76.3% |
autoformer-searching-transformers-for-visual | - | 54M | - | 82.4% |
which-transformer-to-favor-a-comparative | - | - | - | 71.53% |
fixing-the-train-test-resolution-discrepancy-2 | - | 7.8M | - | 82.6% |
automix-unveiling-the-power-of-mixup | - | 11.7M | - | 72.05% |
deepvit-towards-deeper-vision-transformer | - | - | - | 83.1% |
unireplknet-a-universal-perception-large | - | - | - | 77% |
knowledge-distillation-a-good-teacher-is | - | 82.8% | ||
cas-vit-convolutional-additive-self-attention | - | 5.76 | - | 81.1% |
mobilevitv3-mobile-friendly-vision | - | 3M | - | 76.55% |
internimage-exploring-large-scale-vision | - | 30M | - | 83.5% |
meta-pseudo-labels | - | - | - | 83.2% |
training-data-efficient-image-transformers | 87M | 85.2% | ||
mobilevitv3-mobile-friendly-vision | - | 2.5M | - | 76.7% |
vicinity-vision-transformer | - | 61.8M | - | 84.7% |
a-dot-product-attention-free-transformer | - | 23M | - | 80.1% |
regularized-evolution-for-image-classifier | - | 469M | - | 83.9% |
an-image-is-worth-16x16-words-transformers-1 | - | - | - | 88.55% |
coatnet-marrying-convolution-and-attention | - | 168M | - | 84.5% |
evo-vit-slow-fast-token-evolution-for-dynamic | - | 39.6M | - | 82.2% |
swin-transformer-v2-scaling-up-capacity-and | - | 88M | - | 87.1% |
sequencer-deep-lstm-for-image-classification | - | 38M | - | 82.8% |
maxvit-multi-axis-vision-transformer | - | - | - | 88.51% |
cspnet-a-new-backbone-that-can-enhance | - | 20.5M | - | 79.8% |
clcnet-rethinking-of-ensemble-modeling-with | - | - | - | 86.46% |
spatial-group-wise-enhance-improving-semantic | - | 44.55M | - | 78.798% |
2103-15358 | - | 79M | - | 81.9% |
mobilenetv4-universal-models-for-the-mobile | - | - | - | 83.4% |
efficient-multi-order-gated-aggregation | - | 44M | - | 84.3% |
revbifpn-the-fully-reversible-bidirectional | - | 82M | - | 83.7% |
a-fast-knowledge-distillation-framework-for | - | - | - | 80.1% |
uninet-unified-architecture-search-with | - | 22.5M | - | 82.7% |
collaboration-of-experts-achieving-80-top-1 | - | 95.3M | - | 80.7% |
self-training-with-noisy-student-improves | 51800G | 480M | 88.4% | |
gpipe-efficient-training-of-giant-neural | - | - | - | 84.4% |
proxylessnas-direct-neural-architecture | - | 4.0M | - | 74.6% |
densenets-reloaded-paradigm-shift-beyond | - | 24M | - | 82.8% |
efficient-multi-order-gated-aggregation | - | 181M | - | 87.8% |
go-wider-instead-of-deeper | - | 63M | - | 80.09% |
a-convnet-for-the-2020s | - | 29M | - | 82.1% |
rethinking-the-design-principles-of-robust | - | 91.8M | - | 82.7% |
densenets-reloaded-paradigm-shift-beyond | - | 186M | - | 84.8% |
deit-iii-revenge-of-the-vit | - | - | - | 86.7% |
metaformer-baselines-for-vision | - | 100M | - | 84.8% |
training-data-efficient-image-transformers | - | 5M | - | 76.6% |
vitae-vision-transformer-advanced-by | - | 6.5M | - | 77.9% |
harmonic-convolutional-networks-based-on | - | 88.2M | - | 82.85% |
bottleneck-transformers-for-visual | - | 75.1M | - | 84.7% |
bottleneck-transformers-for-visual | - | - | - | 84.2% |
the-effectiveness-of-mae-pre-pretraining-for | - | - | - | 88.8% |
resmlp-feedforward-networks-for-image | - | 30M | - | 80.8% |
hyenapixel-global-image-context-with | - | - | - | 83.6% |
mambavision-a-hybrid-mamba-transformer-vision | - | 50.1M | - | 83.3% |
resnet-strikes-back-an-improved-training | - | 60.2M | - | 81.8% |
multiscale-vision-transformers | - | 37M | - | 83.0% |
multimodal-autoregressive-pre-training-of | - | 300M | - | 86.6% |
sharpness-aware-minimization-for-efficiently-1 | - | 480M | - | 88.61% |
bottleneck-transformers-for-visual | - | 54.7M | - | 82.8% |
internimage-exploring-large-scale-vision | - | 223M | - | 87.7% |
nasvit-neural-architecture-search-for | - | - | - | 82.9% |
ghostnetv3-exploring-the-training-strategies | - | - | - | 77.1% |
fbnetv5-neural-architecture-search-for | - | - | - | 82.6% |
lets-keep-it-simple-using-simple | - | 5.7M | - | 79.12 |
levit-a-vision-transformer-in-convnet-s | - | 10.4M | - | 80% |
transnext-robust-foveal-visual-perception-for | - | 49.7M | - | 84.7% |
the-effectiveness-of-mae-pre-pretraining-for | - | 2000M | - | 89.8% |
metaformer-baselines-for-vision | - | 57M | - | 86.9% |
hyenapixel-global-image-context-with | - | - | - | 83.2% |
meta-pseudo-labels | 95040G | 480M | 90.2% | |
improved-multiscale-vision-transformers-for | - | 24M | - | 82.3% |
coatnet-marrying-convolution-and-attention | - | 25M | - | 81.6% |
data2vec-a-general-framework-for-self-1 | - | 656M | - | 86.6% |
internimage-exploring-large-scale-vision | - | 1080M | - | 89.6% |
colormae-exploring-data-independent-masking | - | - | - | 83.8% |
maxvit-multi-axis-vision-transformer | - | - | - | 86.19% |
kolmogorov-arnold-transformer | - | 86.6M | - | 79.1 |
cvt-introducing-convolutions-to-vision | - | - | - | 83.3% |
vision-transformer-with-deformable-attention | - | 88M | - | 84.8% |
dilated-neighborhood-attention-transformer | - | 28M | - | 82.7% |
which-transformer-to-favor-a-comparative | - | - | - | 80.66% |
hiera-a-hierarchical-vision-transformer | - | - | - | 86.9% |
an-improved-one-millisecond-mobile-backbone | - | 7.8M | - | 79.1% |
metaformer-baselines-for-vision | - | 27M | - | 84.4% |
multigrain-a-unified-image-embedding-for | - | - | - | 81.3% |
xcit-cross-covariance-image-transformers | - | 26M | - | 85.1% |
residual-attention-network-for-image | - | - | - | 80.5% |
fairnas-rethinking-evaluation-fairness-of | - | 4.5M | - | 75.10% |
beyond-self-attention-external-attention | - | 81.7% | ||
minivit-compressing-vision-transformers-with | - | 47M | - | 85.5% |
learned-queries-for-efficient-local-attention | - | 16M | - | 81.7% |
token-labeling-training-a-85-5-top-1-accuracy | - | 151M | - | 86.4% |
meal-v2-boosting-vanilla-resnet-50-to-80-top | - | 25.6M | - | 81.72% |
how-to-use-dropout-correctly-on-residual | - | - | - | 79.152% |
densenets-reloaded-paradigm-shift-beyond | - | 186M | - | 85.8% |
cvt-introducing-convolutions-to-vision | - | - | - | 82.5% |
activemlp-an-mlp-like-architecture-with | - | 76.4M | - | 84.8% |
moga-searching-beyond-mobilenetv3 | 5.1M | 0.0304G | 75.9% | |
mlp-mixer-an-all-mlp-architecture-for-vision | - | 87.94% | ||
fixing-the-train-test-resolution-discrepancy-2 | - | 12M | - | 85% |
mixnet-mixed-depthwise-convolutional-kernels | - | 5.0M | - | 77% |
learned-queries-for-efficient-local-attention | - | 56M | - | 83.7% |
ghostnetv3-exploring-the-training-strategies | - | - | - | 80.4% |
ghostnetv3-exploring-the-training-strategies | - | - | - | 79.1% |
190409460 | - | - | - | 79.03% |
searching-for-mobilenetv3 | - | 5.4M | - | 75.2% |
clcnet-rethinking-of-ensemble-modeling-with | - | - | - | 85.28% |
convit-improving-vision-transformers-with | - | 6M | - | 73.1% |
eca-net-efficient-channel-attention-for-deep | - | 42.49M | - | 78.65% |
metaformer-baselines-for-vision | - | 99M | - | 85.5% |
fastervit-fast-vision-transformers-with | - | 957.5M | - | 85.6% |
transboost-improving-the-best-imagenet | - | 22.05M | - | 83.67% |
fastervit-fast-vision-transformers-with | - | 31.4M | - | 82.1% |
hyenapixel-global-image-context-with | - | - | - | 85.2% |
which-transformer-to-favor-a-comparative | - | - | - | 82.22% |
resnet-strikes-back-an-improved-training | - | 25M | - | 78.1% |
balanced-binary-neural-networks-with-gated | - | - | - | 62.6% |
from-xception-to-nexception-new-design | - | - | - | 81.5% |
gswin-gated-mlp-vision-model-with | - | 21.8M | - | 81.71% |
semantic-aware-local-global-vision | - | 6.5M | - | 75.9% |
mobilenetv2-inverted-residuals-and-linear | - | 3.4M | - | 72% |
incepformer-efficient-inception-transformer | - | 39.3M | - | 83.6% |
volo-vision-outlooker-for-visual-recognition | - | 193M | - | 86.8% |
tiny-models-are-the-computational-saver-for | - | - | - | 85.74 |
which-transformer-to-favor-a-comparative | - | - | - | 83.09% |
metaformer-baselines-for-vision | - | 99M | - | 88.1% |
multimodal-autoregressive-pre-training-of | - | 1200M | - | 88.1% |
revisiting-unreasonable-effectiveness-of-data | - | - | - | 79.2% |
going-deeper-with-image-transformers | - | 12M | - | 80.9% |
pvtv2-improved-baselines-with-pyramid-vision | - | 82M | - | 83.8% |
hvt-a-comprehensive-vision-framework-for | - | - | - | 85% |
metaformer-baselines-for-vision | - | 99M | - | 87.4% |
averaging-weights-leads-to-wider-optima-and | - | - | - | 78.94% |
differentiable-top-k-classification-learning-1 | - | - | - | 88.37% |
fixing-the-train-test-resolution-discrepancy-2 | - | 43M | - | 86.7% |
fastvit-a-fast-hybrid-vision-transformer | - | - | - | 84.5% |
coatnet-marrying-convolution-and-attention | - | - | - | 87.1% |
sp-vit-learning-2d-spatial-priors-for-vision | - | - | - | 86% |
mobilenetv2-inverted-residuals-and-linear | - | 6.9M | - | 74.7% |
mobilenetv4-universal-models-for-the-mobile | - | - | - | 82.9% |
masked-autoencoders-are-scalable-vision | - | - | - | 83.6% |
vitae-vision-transformer-advanced-by | - | 19.2M | - | 82.2% |
convit-improving-vision-transformers-with | - | 27M | - | 81.3% |
fastervit-fast-vision-transformers-with | - | 1360M | - | 85.8% |
resnet-strikes-back-an-improved-training | - | 25M | - | 80.4% |
training-data-efficient-image-transformers | - | 22M | - | 82.6% |
incepformer-efficient-inception-transformer | - | 24.3M | - | 82.9% |
soft-conditional-computation | - | - | - | 78.3% |
transnext-robust-foveal-visual-perception-for | - | 28.2M | - | 84.0% |
go-wider-instead-of-deeper | - | 29M | - | 77.54% |
multigrain-a-unified-image-embedding-for | - | - | - | 82.7% |
an-improved-one-millisecond-mobile-backbone | - | 4.8M | - | 77.4% |
large-scale-learning-of-general-visual | - | - | - | 87.54% |
augmenting-convolutional-networks-with | - | 25.2M | - | 82.1% |
dynamic-convolution-attention-over | - | 42.7M | - | 72.7% |
efficientnetv2-smaller-models-and-faster | - | - | - | 83.9% |
fastvit-a-fast-hybrid-vision-transformer | - | - | - | 79.1% |
revit-enhancing-vision-transformers-with | - | - | - | 82.4 |
bottleneck-transformers-for-visual | - | 33.5M | - | 81.7% |
self-knowledge-distillation-a-simple-way-for | - | - | - | 79.24% |
resmlp-feedforward-networks-for-image | - | 116M | - | 83.6% |
tinyvit-fast-pretraining-distillation-for | - | 5.4M | - | 79.1% |
190409460 | - | - | - | 79.38% |
densely-connected-convolutional-networks | - | - | - | 77.85% |
augmenting-convolutional-networks-with | - | 47.7M | - | 83.2% |
fast-vision-transformers-with-hilo-attention | - | - | - | 83.6% |
glit-neural-architecture-search-for-global | - | 24.6M | - | 80.5% |
fixing-the-train-test-resolution-discrepancy-2 | - | 5.3M | - | 80.2% |
sharpness-aware-minimization-for-efficiently-1 | - | - | - | 81.6% |
high-performance-large-scale-image | - | 377.2M | - | 86.3% |
tokens-to-token-vit-training-vision | - | 39.2M | - | 82.2% |
densely-connected-search-space-for-more | - | - | - | 75.9% |
the-effectiveness-of-mae-pre-pretraining-for | - | - | - | 86.8% |
davit-dual-attention-vision-transformers | - | 1437M | - | 90.4% |
tinyvit-fast-pretraining-distillation-for | - | 21M | - | 86.5% |
not-all-images-are-worth-16x16-words-dynamic | - | - | - | 79.74% |
graph-convolutions-enrich-the-self-attention | - | - | - | 81.5% |
efficientnetv2-smaller-models-and-faster | - | 208M | - | 87.3% |
res2net-a-new-multi-scale-backbone | - | - | - | 81.23% |
internimage-exploring-large-scale-vision | - | 335M | - | 88% |
fastvit-a-fast-hybrid-vision-transformer | - | - | - | 75.6% |
torchdistill-a-modular-configuration-driven | - | - | - | 71.37% |
adversarial-autoaugment-1 | - | - | - | 81.32% |
bottleneck-transformers-for-visual | - | 25.5M | - | 78.8% |
efficientnetv2-smaller-models-and-faster | - | 120M | - | 86.8% |
rexnet-diminishing-representational | - | 7.6M | - | 79.5% |
designing-network-design-spaces | - | 20.6M | - | 79.4% |
maxvit-multi-axis-vision-transformer | - | - | - | 88.82% |
revisiting-weakly-supervised-pre-training-of | - | 633.5M | - | 88.6% |
clcnet-rethinking-of-ensemble-modeling-with | - | - | - | 86.61% |
efficientnet-rethinking-model-scaling-for | - | 30M | - | 83.3% |
unireplknet-a-universal-perception-large | - | - | - | 78.6% |
masked-autoencoders-are-scalable-vision | - | - | - | 86.9% |
lambdanetworks-modeling-long-range-1 | 42M | 84.3% | ||
three-things-everyone-should-know-about | - | - | - | 82.3% |
repmlpnet-hierarchical-vision-mlp-with-re | - | - | - | 81.8% |
densenets-reloaded-paradigm-shift-beyond | - | 87M | - | 84.4% |
rethinking-and-improving-relative-position | - | 87M | - | 82.4% |
metaformer-baselines-for-vision | - | 57M | - | 85.6% |
metaformer-baselines-for-vision | - | 39M | - | 85.7% |
maxvit-multi-axis-vision-transformer | - | - | - | 88.38% |
meta-knowledge-distillation | - | - | - | 77.1% |
sequencer-deep-lstm-for-image-classification | - | 54M | - | 84.6% |
モデル 1025 | - | - | - | 78.8 |
fixing-the-train-test-resolution-discrepancy-2 | - | 30M | - | 86.4% |
meal-v2-boosting-vanilla-resnet-50-to-80-top | - | - | - | 73.19% |
meta-knowledge-distillation | - | - | - | 86.5% |
transboost-improving-the-best-imagenet | - | 28.59M | - | 82.46% |
improved-multiscale-vision-transformers-for | - | 218M | - | 88.4% |
aggregated-residual-transformations-for-deep | - | 83.6M | - | 80.9% |
deeper-vs-wider-a-revisit-of-transformer | - | - | - | 86.3 |
an-improved-one-millisecond-mobile-backbone | - | 7.8 | - | 77.4% |
deep-polynomial-neural-networks | - | - | - | 77.17% |
improving-vision-transformers-by-revisiting | - | 295.5M | - | 87.3% |
scalable-visual-transformers-with | - | 21.74M | - | 78.00% |
revbifpn-the-fully-reversible-bidirectional | - | 19.6M | - | 81.1% |
localvit-bringing-locality-to-vision | - | 5.9M | - | 74.8% |
contextual-transformer-networks-for-visual | - | 23.1M | - | 81.6% |
maxvit-multi-axis-vision-transformer | - | - | - | 89.36% |
mlp-mixer-an-all-mlp-architecture-for-vision | - | 46M | - | 76.44% |
rexnet-diminishing-representational | - | 4.1M | - | 77.2% |
an-intriguing-failing-of-convolutional-neural | - | - | - | 75.74% |
efficientvit-enhanced-linear-attention-for | - | 24M | - | 82.7% |
scaling-vision-with-sparse-mixture-of-experts | - | 3400M | - | 87.41% |
lip-local-importance-based-pooling | - | 8.7M | - | 76.64% |
vitae-vision-transformer-advanced-by | - | 48.5M | - | 83.6% |
the-effectiveness-of-mae-pre-pretraining-for | - | 650M | - | 89.5% |
multimodal-autoregressive-pre-training-of | - | 600M | - | 87.5% |
going-deeper-with-image-transformers | - | 26.6M | - | 84.1% |
visual-parser-representing-part-whole | - | - | - | 84.2% |
augmenting-convolutional-networks-with | - | 334.3M | - | 87.1% |
learnable-polynomial-trigonometric-and | - | 28M | - | 82.34 |
sliced-recursive-transformer-1 | - | 71.2M | - | 84.8% |
contextual-transformer-networks-for-visual | - | 40.9M | - | 83.2% |
model-rubik-s-cube-twisting-resolution-depth | - | 5.1M | - | 77.7% |
augmenting-convolutional-networks-with | - | 25.2M | - | 85.4% |
meta-knowledge-distillation | - | - | - | 85.1% |