HyperAI

Image Classification On Imagenet

Metrics

Hardware Burden
Number of params
Operations per network pass
Top 1 Accuracy

Results

Performance results of various models on this benchmark

Comparison Table
Model NameHardware BurdenNumber of paramsOperations per network passTop 1 Accuracy
xception-deep-learning-with-depthwise87G22.855952M0.838G79%
deep-residual-learning-for-image-recognition-40M-78.25%
cvt-introducing-convolutions-to-vision-20M-83%
densely-connected-convolutional-networks---77.42%
convit-improving-vision-transformers-with-10M-76.7%
when-vision-transformers-outperform-resnets-87M-79.9%
pvtv2-improved-baselines-with-pyramid-vision-25.4M-82%
metaformer-baselines-for-vision-40M-85.4%
mambavision-a-hybrid-mamba-transformer-vision-241.5M-85.3%
fbnetv5-neural-architecture-search-for---78.4%
gated-convolutional-networks-with-hybrid-12.9M-78.5%
convit-improving-vision-transformers-with-48M-82.2%
spatial-channel-token-distillation-for-vision-122.6M-82.4%
rethinking-and-improving-relative-position-22M-80.9%
bottleneck-transformers-for-visual-53.9M-84%
augmenting-sub-model-to-improve-main-model-86.6M-84.2%
maxvit-multi-axis-vision-transformer-120M-84.94%
biformer-vision-transformer-with-bi-level---84.3%
incorporating-convolution-designs-into-visual-6.4M-76.4%
automix-unveiling-the-power-of-mixup-44.6M-80.98%
resnest-split-attention-networks-27.5M-81.13%
dilated-neighborhood-attention-transformer-90M-84.4%
torchdistill-a-modular-configuration-driven---71.08%
next-vit-next-generation-vision-transformer-44.8M-83.2%
improved-multiscale-vision-transformers-for-218M-86.3%
cyclemlp-a-mlp-like-architecture-for-dense-76M-83.2%
efficientnetv2-smaller-models-and-faster-54M-86.2%
sp-vit-learning-2d-spatial-priors-for-vision---83.9%
an-improved-one-millisecond-mobile-backbone-2.1M-72.5%
high-performance-large-scale-image-193.8M-85.1%
mixpro-data-augmentation-with-maskmix-and---76.7%
co-training-2-l-submodels-for-visual---87.5%
hvt-a-comprehensive-vision-framework-for---87.4%
basisnet-two-stage-model-synthesis-for-1---80%
vitae-vision-transformer-advanced-by---75.3%
semi-supervised-recognition-under-a-noisy-and-25.58M-83.0%
Model 37---80.91%
alphanet-improved-training-of-supernet-with---80.0%
which-transformer-to-favor-a-comparative---82.29%
localvit-bringing-locality-to-vision-22.4M-80.8%
bottleneck-transformers-for-visual---83.5%
a-convnet-for-the-2020s-350M-87.8%
an-improved-one-millisecond-mobile-backbone-4.8M-75.9
an-image-is-worth-16x16-words-transformers-1---87.76%
190409925---79.1%
shape-texture-debiased-neural-network-1-81.2
gswin-gated-mlp-vision-model-with-39.8M-83.01%
revbifpn-the-fully-reversible-bidirectional-48.7M-83%
rethinking-spatial-dimensions-of-vision-4.9M-74.6%
dilated-neighborhood-attention-transformer---86.5%
an-evolutionary-approach-to-dynamic---86.74%
2103-14899-27.4M-81.5%
exploring-target-representations-for-masked---85.7%
on-the-adequacy-of-untuned-warmup-for---72.1%
alphanet-improved-training-of-supernet-with---77.8%
multigrain-a-unified-image-embedding-for---83.2%
gswin-gated-mlp-vision-model-with-15.5M-80.32%
Model 58---80.25%
wavemix-lite-a-resource-efficient-neural-1-32.4M-67.7%
randaugment-practical-data-augmentation-with---85.4%
maxvit-multi-axis-vision-transformer---88.32%
gtp-vit-efficient-vision-transformers-via---85.4%
an-improved-one-millisecond-mobile-backbone-14.8M-79.4%
cvt-introducing-convolutions-to-vision-32M-84.9%
splitnet-divide-and-co-training-88.6M-82.13%
tinyvit-fast-pretraining-distillation-for-21M-84.8%
performance-of-gaussian-mixture-model---84.1%
resnest-split-attention-networks-70M-83.9%
what-do-deep-networks-like-to-see---77.12%
filter-response-normalization-layer---78.95%
do-you-even-need-attention-a-stack-of-feed-74.9
repmlp-re-parameterizing-convolutions-into52.77M78.60%
scaling-vision-with-sparse-mixture-of-experts-7200M-88.36%
drop-an-octave-reducing-spatial-redundancy-in20771G66.8M2.22G82.9%
self-training-with-noisy-student-improves-9.2M-82.4%
visformer-the-vision-friendly-transformer-10.3M-78.6%
mixnet-mixed-depthwise-convolutional-kernels-4.1M-75.8%
three-things-everyone-should-know-about---83.4%
multigrain-a-unified-image-embedding-for---83.6%
dynamic-convolution-attention-over-18.6M-67.7%
token-labeling-training-a-85-5-top-1-accuracy-26M-83.3%
metaformer-baselines-for-vision-56M-85.2%
rethinking-local-perception-in-lightweight-4.2M-77%
lets-keep-it-simple-using-simple-3M-75.66
elsa-enhanced-local-self-attention-for-vision-298M-87.2%
an-improved-one-millisecond-mobile-backbone-2.1M-71.4%
bottleneck-transformers-for-visual-66.6M-82.2%
dilated-neighborhood-attention-transformer-20M-81.8%
revbifpn-the-fully-reversible-bidirectional-142.3M-84.2%
exploring-target-representations-for-masked---88.2%
fixing-the-train-test-resolution-discrepancy-2-9.2M-83.6%
kolmogorov-arnold-transformer-86.6M-81.8
Model 93---70.54
debiformer-vision-transformer-with-deformable---83.9%
mega-moving-average-equipped-gated-attention-90M-82.4%
espnetv2-a-light-weight-power-efficient-and-5.9M-74.9%
multigrain-a-unified-image-embedding-for---83.0%
semi-supervised-recognition-under-a-noisy-and-25.58M-84.0%
when-shift-operation-meets-vision-transformer-88M-83.3%
polynomial-networks-in-deep-classifiers-11.51M-71.6%
peco-perceptual-codebook-for-bert-pre---87.5%
puzzle-mix-exploiting-saliency-and-local-1-78.76%
token-labeling-training-a-85-5-top-1-accuracy-56M-84.1%
bottleneck-transformers-for-visual---83.8%
exploring-the-limits-of-weakly-supervised-829M-85.4%
fastvit-a-fast-hybrid-vision-transformer---82.6%
training-data-efficient-image-transformers-86M-84.2%
an-image-is-worth-16x16-words-transformers-1---24%
coca-contrastive-captioners-are-image-text2100M91.0%
circumventing-outliers-of-autoaugment-with88M85.8%
contrastive-learning-rivals-masked-image-307M-89.0%
transnext-robust-foveal-visual-perception-for-12.8M-82.5%
hyenapixel-global-image-context-with---84.9%
2103-15358-39.7M-83.3%
a-dot-product-attention-free-transformer-22.6M-79.8%
tinyvit-fast-pretraining-distillation-for-11M-83.2%
torchdistill-a-modular-configuration-driven---70.09%
florence-a-new-foundation-model-for-computer-893M-90.05%
coatnet-marrying-convolution-and-attention-42M-83.3%
high-performance-large-scale-image-377.2M-86.0%
transboost-improving-the-best-imagenet-25.56M-81.15%
maxvit-multi-axis-vision-transformer-69M-84.45%
sp-vit-learning-2d-spatial-priors-for-vision---85.5%
transboost-improving-the-best-imagenet-60.19M-80.64%
efficient-self-supervised-learning-with---87.4%
revbifpn-the-fully-reversible-bidirectional-5.11M-75.9%
scarletnas-bridging-the-gap-between-6.5M-76.3%
splitnet-divide-and-co-training98M83.34%
fastervit-fast-vision-transformers-with-53.4M-83.2%
fairnas-rethinking-evaluation-fairness-of-4.6M-75.34%
dynamic-convolution-attention-over-4M-69.4%
resmlp-feedforward-networks-for-image-17.7M-78.6%
generalized-parametric-contrastive-learning---86.01%
co-training-2-l-submodels-for-visual---85.8%
mixpro-data-augmentation-with-maskmix-and---73.8%
spinenet-learning-scale-permuted-backbone-for-60.5M-79%
unicom-universal-and-compact-representation---88.3
billion-scale-semi-supervised-learning-for-193M-84.8%
efficient-multi-order-gated-aggregation-3M-77.2%
pattern-attention-transformer-with-doughnut---83.6%
fastervit-fast-vision-transformers-with-159.5M-84.9%
reproducible-scaling-laws-for-contrastive---88.5%
which-transformer-to-favor-a-comparative---82.11%
190411491-14.7M-75.7%
graph-rise-graph-regularized-image-semantic---68.29%
muxconv-information-multiplexing-in-2.4M-71.6%
mobilenets-efficient-convolutional-neural---70.6%
fractalnet-ultra-deep-neural-networks-without---75.88%
xcit-cross-covariance-image-transformers-189M-86%
understanding-gaussian-attention-bias-of---81.484%
fixing-the-train-test-resolution-discrepancy---79.1%
scaling-vision-transformers-to-22-billion-86M-88.6%
ghostnet-more-features-from-cheap-operations-13M-75%
volo-vision-outlooker-for-visual-recognition-296M-87.1%
dat-spatially-dynamic-vision-transformer-with-53M-84.6%
convit-improving-vision-transformers-with-86M-82.4%
efficientvit-enhanced-linear-attention-for-64M-85.6%
mambavision-a-hybrid-mamba-transformer-vision-227.9M-85%
greedynas-towards-fast-one-shot-nas-with6.5M77.1%
visual-attention-network-200M-86.9%
a-dot-product-attention-free-transformer-23M-80.8%
scalable-visual-transformers-with-5.74M-69.64%
on-the-performance-analysis-of-momentum---76.91%
levit-a-vision-transformer-in-convnet-s-39.4M-82.5%
fixing-the-train-test-resolution-discrepancy-25.6M-82.5%
scaling-up-your-kernels-to-31x31-revisiting-335M-87.8%
contextual-transformer-networks-for-visual-55.8M-84.6%
meta-knowledge-distillation---83.1%
involution-inverting-the-inherence-of-34M-79.3%
tokens-to-token-vit-training-vision---83.3%
rethinking-the-design-principles-of-robust-23.3M-81.9%
which-transformer-to-favor-a-comparative---83.61%
a-convnet-for-the-2020s-1827M-88.36%
clcnet-rethinking-of-ensemble-modeling-with---86.42%
going-deeper-with-image-transformers-438M-86.5%
three-things-everyone-should-know-about---85.5%
fbnetv5-neural-architecture-search-for---84.1%
sp-vit-learning-2d-spatial-priors-for-vision---85.1%
semi-supervised-recognition-under-a-noisy-and-5.47M-79.0%
augmenting-convolutional-networks-with-99.4M-83.5%
from-xception-to-nexception-new-design---82%
automix-unveiling-the-power-of-mixup-21.8M-76.1%
balanced-binary-neural-networks-with-gated---59.4%
co-training-2-l-submodels-for-visual---88.0%
transnext-robust-foveal-visual-perception-for-49.7M-86.0%
refiner-refining-self-attention-for-vision81M86.03
co-training-2-l-submodels-for-visual---86.3%
repvgg-making-vgg-style-convnets-great-again-80.31M-78.78%
firecaffe-near-linear-acceleration-of-deep---58.9%
convmlp-hierarchical-convolutional-mlps-for-42.7M-80.2%
Model 191---78.15%
internimage-exploring-large-scale-vision-50M-84.2%
2103-15358-24.6M-82%
scaling-up-visual-and-vision-language480M88.64%
from-xception-to-nexception-new-design---81.8%
gtp-vit-efficient-vision-transformers-via---83.7%
multigrain-a-unified-image-embedding-for---79.4%
zen-nas-a-zero-shot-nas-for-high-performance-183M-83.0%
kolmogorov-arnold-transformer-86.6M-82.8
a-fast-knowledge-distillation-framework-for-5M-78.7%
hyenapixel-global-image-context-with---83.5%
gtp-vit-efficient-vision-transformers-via---81.9%
rethinking-and-improving-relative-position---81.1%
which-transformer-to-favor-a-comparative---81.33%
one-peace-exploring-one-general-1520M--
moat-alternating-mobile-convolution-and-483.2M-89.1%
sequencer-deep-lstm-for-image-classification-28M-82.3%
dilated-neighborhood-attention-transformer-51M-83.8%
tinyvit-fast-pretraining-distillation-for-11M-81.5%
Model 210---78.75%
incorporating-convolution-designs-into-visual---82%
high-performance-large-scale-image-527M-89.2%
self-training-with-noisy-student-improves-66M-86.9%
cswin-transformer-a-general-vision-173M-87.5%
debiformer-vision-transformer-with-deformable---84.4%
cutmix-regularization-strategy-to-train---78.4%
mambavision-a-hybrid-mamba-transformer-vision-97.7M-84.2%
expeditious-saliency-guided-mix-up-through---77.39%
nasvit-neural-architecture-search-for---78.2%
volo-vision-outlooker-for-visual-recognition-59M-86%
wave-vit-unifying-wavelet-and-transformers-33.5M-84.8%
alphanet-improved-training-of-supernet-with---78.9%
nasvit-neural-architecture-search-for---81.0%
mixpro-data-augmentation-with-maskmix-and---82.9%
omnivec2-a-novel-transformer-based-network---89.3%
transboost-improving-the-best-imagenet-71.71M-82.16%
rexnet-diminishing-representational-2.7M-74.6%
next-vit-next-generation-vision-transformer-57.8M-84.7%
vision-models-are-more-robust-and-fair-when-10000M-85.8%
nasvit-neural-architecture-search-for---81.4%
masked-autoencoders-are-scalable-vision---85.9%
vitaev2-vision-transformer-advanced-by-644M-88.5%
neighborhood-attention-transformer-20M-81.8%
an-image-is-worth-16x16-words-transformers-1----
revisiting-resnets-improved-training-and192M84.4%
efficientnet-rethinking-model-scaling-for-66M-84.4%
2103-15358-55.7M-83.2%
lets-keep-it-simple-using-simple-1.5M-61.52
vision-gnn-an-image-is-worth-graph-of-nodes-10.7M-78.2%
adaptive-split-fusion-transformer-56.7M-83.9%
gtp-vit-efficient-vision-transformers-via---85.8%
uninet-unified-architecture-search-with-1-11.5M-80.8%
augmenting-convolutional-networks-with-99.4M-86.5%
mixpro-data-augmentation-with-maskmix-and---80.6%
going-deeper-with-image-transformers-68.2M-85.4%
which-transformer-to-favor-a-comparative---82.54%
Model 247---76.3
fastvit-a-fast-hybrid-vision-transformer---84.9%
davit-dual-attention-vision-transformers-87.9M-86.9%
co-training-2-l-submodels-for-visual---87.1%
Model 251---81.92%
firecaffe-near-linear-acceleration-of-deep---68.3%
sp-vit-learning-2d-spatial-priors-for-vision---84.9%
efficientnet-rethinking-model-scaling-for-9.2M-79.8%
efficientnet-rethinking-model-scaling-for-12M-81.1%
exploring-the-limits-of-weakly-supervised-466M-85.1%
correlated-input-dependent-label-noise-in-68.6%
not-all-images-are-worth-16x16-words-dynamic---78.48%
rethinking-and-improving-relative-position---81.4%
container-context-aggregation-network-20M-82%
self-training-with-noisy-student-improves-30M-86.1%
srm-a-style-based-recalibration-module-for---78.47%
densely-connected-convolutional-networks---74.98%
self-training-with-noisy-student-improves-19M-85.3%
Model 265---78.36%
exploring-randomly-wired-neural-networks-for-61.5M-80.1%
mambavision-a-hybrid-mamba-transformer-vision-31.8M-82.3%
fast-vision-transformers-with-hilo-attention-28M-82%
your-diffusion-model-is-secretly-a-zero-shot---79.1%
xcit-cross-covariance-image-transformers-84M-85.8%
semi-supervised-learning-of-visual-features---75.5%
cutmix-regularization-strategy-to-train---80.53%
convit-improving-vision-transformers-with152M82.5%
res2net-a-new-multi-scale-backbone---78.59%
exploring-target-representations-for-masked---87.8%
going-deeper-with-image-transformers-271M-86.3%
rest-an-efficient-transformer-for-visual-13.66M-79.6%
resnest-split-attention-networks-27.5M-80.64%
mobilevitv3-mobile-friendly-vision-5.8M-79.3%
high-performance-large-scale-image-254.9M-85.7%
visual-attention-network-60M-86.6%
tokenmixup-efficient-attention-guided-token---82.37%
wide-residual-networks---78.1%
unsupervised-data-augmentation-1---79.04%
mobilenetv4-universal-models-for-the-mobile---79.9%
parametric-contrastive-learning---80.9%
designing-network-design-spaces-6.3M-76.3%
adversarial-autoaugment-1---79.4%
mixmim-mixed-and-masked-image-modeling-for-88M-85.1%
neighborhood-attention-transformer-28M-83.2%
mixpro-data-augmentation-with-maskmix-and---82.7%
hornet-efficient-high-order-spatial---87.7%
vitae-vision-transformer-advanced-by-13.2M-81%
going-deeper-with-image-transformers-46.9M-85.1%
unireplknet-a-universal-perception-large---87.4%
designing-bert-for-convolutional-networks-198M-86.0%
swin-transformer-hierarchical-vision-88M-86.4%
Model 298---77.71%
quantnet-learning-to-quantize-by-learning---71.97%
adversarial-examples-improve-image-66M-85.2%
whats-hidden-in-a-randomly-weighted-neural-20.6M-73.3%
involution-inverting-the-inherence-of-15.5M-78.4%
elsa-enhanced-local-self-attention-for-vision-28M-82.7%
next-vit-next-generation-vision-transformer-31.7M-82.5%
muxconv-information-multiplexing-in-4.0M-76.6%
alphanet-improved-training-of-supernet-with---80.3%
2103-14899-44.3M-82.8%
sliced-recursive-transformer-1-21.3M-84.3%
resmlp-feedforward-networks-for-image---79.4%
mobilevit-light-weight-general-purpose-and-5.6M-78.4%
masked-image-residual-learning-for-scaling-1-96M-84.8%
neighborhood-attention-transformer-90M-84.3%
bottleneck-transformers-for-visual-49.2M-81.4%
pvtv2-improved-baselines-with-pyramid-vision-45.2M-83.2%
scalable-pre-training-of-large-autoregressive---84.0
distilled-gradual-pruning-with-pruned-fine-2.56M-73.66%
uninet-unified-architecture-search-with-1-72.9M-87%
fastvit-a-fast-hybrid-vision-transformer---80.6%
gated-convolutional-networks-with-hybrid-42.2M-80.5%
augmenting-sub-model-to-improve-main-model-632M-85.7%
fbnetv5-neural-architecture-search-for---77.2%
tiny-models-are-the-computational-saver-for---85.75
mish-a-self-regularized-non-monotonic-neural---79.8%
differentiable-model-compression-via-pseudo-82.0
graph-convolutions-enrich-the-self-attention---82.8%
model-soups-averaging-weights-of-multiple-1843M-90.94%
torchdistill-a-modular-configuration-driven---70.52%
filter-response-normalization-layer---77.21%
scarletnas-bridging-the-gap-between-6M-75.6%
cas-vit-convolutional-additive-self-attention-21.76-84.1%
lets-keep-it-simple-using-simple-9.5M-74.17
fast-autoaugment---80.6%
self-training-with-noisy-student-improves-12M-84.1%
uninet-unified-architecture-search-with-11.9M-79.1%
lets-keep-it-simple-using-simple-1.5M-69.11
involution-inverting-the-inherence-of-9.2-75.9%
not-all-images-are-worth-16x16-words-dynamic-80.43%
uninet-unified-architecture-search-with-73.5M-85.2%
omnivore-a-single-model-for-many-visual---86.0%
densely-connected-convolutional-networks---76.2%
efficientnet-rethinking-model-scaling-for-5.3M-76.3%
rethinking-local-perception-in-lightweight-12.3M-81.6%
compact-global-descriptor-for-neural-networks-4.26M-72.56%
improved-multiscale-vision-transformers-for-667M-88%
revisiting-a-knn-based-image-classification---79.8%
visual-representation-learning-from-unlabeled---85%
efficientvit-enhanced-linear-attention-for-64M-86%
billion-scale-semi-supervised-learning-for-42M-83.4%
incepformer-efficient-inception-transformer-14.0M-80.5%
fast-vision-transformers-with-hilo-attention-87M-84.7%
rexnet-diminishing-representational-34.8M-84.5%
identity-mappings-in-deep-residual-networks---79.9%
mixpro-data-augmentation-with-maskmix-and---81.2%
maxvit-multi-axis-vision-transformer---86.7%
asymmnet-towards-ultralight-convolution-3.1M-68.4%
Model 356-62M-63.3
three-things-everyone-should-know-about---84.1%
mobilevitv3-mobile-friendly-vision-1.2M-70.98%
edgenext-efficiently-amalgamated-cnn-1.3M-71.2%
distilled-gradual-pruning-with-pruned-fine-1.03M-65.59%
sparse-mlp-for-image-recognition-is-self-65.9M-83.4%
tokens-to-token-vit-training-vision-64.4M-82.6%
bias-loss-for-mobile-neural-networks-5.5M-76.2%
metaformer-baselines-for-vision-27M-85.0%
transboost-improving-the-best-imagenet---79.03%
graph-convolutions-enrich-the-self-attention---83%
an-improved-one-millisecond-mobile-backbone-10.1M-80.0%
torchdistill-a-modular-configuration-driven---71.56%
on-the-performance-analysis-of-momentum---67.74%
hrformer-high-resolution-transformer-for-50.3M-82.8%
collaboration-of-experts-achieving-80-top-1---81.5%
metaformer-baselines-for-vision-40M-86.4%
incorporating-convolution-designs-into-visual-24.2M-83.3%
bossnas-exploring-hybrid-cnn-transformers---82.2%
efficientnetv2-smaller-models-and-faster-22M-84.9%
unireplknet-a-universal-perception-large---86.4%
torchdistill-a-modular-configuration-driven---70.93%
densenets-reloaded-paradigm-shift-beyond-50M-83.7%
deepmad-mathematical-architecture-design-for-89M-84%
involution-inverting-the-inherence-of-25.6M-79.1%
uninet-unified-architecture-search-with-73.5M-84.2%
automix-unveiling-the-power-of-mixup-25.6M-79.25%
fixing-the-train-test-resolution-discrepancy62G829M-86.4%
wavemix-lite-a-resource-efficient-neural---74.93%
fast-autoaugment---77.6%
activemlp-an-mlp-like-architecture-with-27.2M-82%
a-dot-product-attention-free-transformer-20.3M-80.2%
uninet-unified-architecture-search-with-14M-80.4%
going-deeper-with-image-transformers-185.9M-85.8%
maxvit-multi-axis-vision-transformer---89.53%
maxup-a-simple-way-to-improve-generalization87.42M85.8%
scaling-local-self-attention-for-parameter-87M-85.5%
global-context-vision-transformers-20M-82.0%
meal-v2-boosting-vanilla-resnet-50-to-80-top---80.67%
ghostnet-more-features-from-cheap-operations-2.6M-66.2%
self-training-with-noisy-student-improves-5.3M-78.8%
convmlp-hierarchical-convolutional-mlps-for-9M-76.8
Model 398---66.04
efficientnet-rethinking-model-scaling-for-43M-84%
semi-supervised-learning-of-visual-features---66.5%
multigrain-a-unified-image-embedding-for---83.1%
metaformer-baselines-for-vision-39M-84.5%
rethinking-and-improving-relative-position-6M-73.7%
perceiver-general-perception-with-iterative-44.9M-78%
rexnet-diminishing-representational-34.7M-82.8%
maxvit-multi-axis-vision-transformer-212M-85.17%
eca-net-efficient-channel-attention-for-deep-57.40M-78.92%
vision-transformer-with-deformable-attention-50M-83.7%
neighborhood-attention-transformer-51M-83.7%
unconstrained-open-vocabulary-image---88.21%
bottleneck-transformers-for-visual-44.4M-80%
parametric-contrastive-learning---81.8%
metaformer-baselines-for-vision-39M-85.8%
which-transformer-to-favor-a-comparative---81.96%
an-algorithm-for-routing-vectors-in-sequences-312.8M-86.7%
global-context-vision-transformers-28M-83.4%
torchdistill-a-modular-configuration-driven-71.71%
maxvit-multi-axis-vision-transformer---86.4%
deformable-kernels-adapting-effective---78.5%
internimage-exploring-large-scale-vision-97M-84.9%
resnest-split-attention-networks-48M-83.0%
sparse-mlp-for-image-recognition-is-self-24.1M-81.9%
autodropout-learning-dropout-patterns-to-80.3%
coatnet-marrying-convolution-and-attention-75M-84.1%
adaptively-connected-neural-networks-29.38M-77.5%
mixpro-data-augmentation-with-maskmix-and---83.7%
rexnet-diminishing-representational-4.8M-77.9%
boosting-discriminative-visual-representation-11.7M-72.33%
cas-vit-convolutional-additive-self-attention-3.2-78.7%
an-improved-one-millisecond-mobile-backbone-14.8M-81.4%
the-information-pathways-hypothesis---81.89%
enhance-the-visual-representation-via---87.02%
polyloss-a-polynomial-expansion-perspective-1---87.2%
wave-vit-unifying-wavelet-and-transformers-22.7M-83.9%
global-context-vision-transformers-90M-84.5%
mixpro-data-augmentation-with-maskmix-and---82.8%
efficient-multi-order-gated-aggregation-83M-84.7%
multimodal-autoregressive-pre-training-of---88.5%
aggregating-nested-transformers-17M-81.5%
fairnas-rethinking-evaluation-fairness-of-4.4M-74.69%
scarletnas-bridging-the-gap-between12G27.8M0.42G82.3%
coatnet-marrying-convolution-and-attention---87.6%
self-training-with-noisy-student-improves-7.8M-81.5%
gtp-vit-efficient-vision-transformers-via---82.8%
Model 445---81.12%
rethinking-local-perception-in-lightweight-7.2M-79.8%
metaformer-baselines-for-vision-26M-83.6%
maxvit-multi-axis-vision-transformer---88.69%
metaformer-baselines-for-vision-100M-85.7%
bias-loss-for-mobile-neural-networks-7.1M-77.1%
high-performance-large-scale-image-71.5M-83.6%
metaformer-baselines-for-vision-57M-84.5%
augmenting-convolutional-networks-with-188.6M-84.1%
revbifpn-the-fully-reversible-bidirectional-3.42M-72.8%
metaformer-baselines-for-vision-40M-84.1%
cvt-introducing-convolutions-to-vision---81.6%
reversible-column-networks-2158M-90.0%
wave-vit-unifying-wavelet-and-transformers-57.5M-85.5%
tokens-to-token-vit-training-vision-21.5M-81.5%
metaformer-baselines-for-vision-27M-83.0%
attentional-feature-fusion34.7M80.22%
three-things-everyone-should-know-about---84.1%
dynamic-convolution-attention-over-11.1M-74.4%
pvtv2-improved-baselines-with-pyramid-vision-3.4M-70.5%
2103-15358-39.8M-82.9%
adaptive-split-fusion-transformer-19.3M-82.7%
multiscale-vision-transformers-72.9M-84.8%
boosting-discriminative-visual-representation-44.6M-81.08%
Model 469---78.15%
generalized-parametric-contrastive-learning---84.0%
davit-dual-attention-vision-transformers-28.3M-82.8%
contextual-classification-using-self-77.0%
aggregating-nested-transformers-38M-83.3%
learned-queries-for-efficient-local-attention-25M-83.2%
generalized-parametric-contrastive-learning---79.7%
gated-attention-coding-for-training-high---70.42
maxvit-multi-axis-vision-transformer---88.46%
dilated-neighborhood-attention-transformer---87.4%
model-soups-averaging-weights-of-multiple-2440M-90.98%
uniformer-unifying-convolution-and-self-100M-86.3%
2103-14899-43.3M-82.5%
dat-spatially-dynamic-vision-transformer-with-24M-83.9%
global-filter-networks-for-image54M82.9%
nasvit-neural-architecture-search-for---79.7%
unireplknet-a-universal-perception-large---83.2%
sparse-mlp-for-image-recognition-is-self-48.6M-83.1%
colornet-investigating-the-importance-of---82.35%
volo-vision-outlooker-for-visual-recognition-27M-85.2%
fastervit-fast-vision-transformers-with-75.9M-84.2%
exploring-the-limits-of-weakly-supervised-194M-84.2%
nasvit-neural-architecture-search-for---80.5%
deit-iii-revenge-of-the-vit---85.7%
compress-image-to-patches-for-vision---77%
discrete-representations-strengthen-vision-1---85.07%
which-transformer-to-favor-a-comparative---83.65%
mamba2d-a-natively-multi-dimensional-state---82.4%
lip-local-importance-based-pooling-42.9M-79.33%
coca-contrastive-captioners-are-image-text-2100M-91.0%
masked-image-residual-learning-for-scaling-1-341M-86.2%
rethinking-the-design-principles-of-robust-10.9M-79.2%
large-scale-learning-of-general-visual-928M-85.39%
co-training-2-l-submodels-for-visual---86.2%
efficient-multi-order-gated-aggregation-25M-83.4%
when-vision-transformers-outperform-resnets-236M-81.1%
unireplknet-a-universal-perception-large---88%
rexnet-diminishing-representational-16.5M-83.2%
dat-spatially-dynamic-vision-transformer-with-93M-84.9%
spatial-group-wise-enhance-improving-semantic-25.56M-77.584%
uniformer-unifying-convolution-and-self-22M-83.4%
augmenting-sub-model-to-improve-main-model-304M-85.3%
averaging-weights-leads-to-wider-optima-and---78.44%
metaformer-baselines-for-vision-100M-87.6%
deit-iii-revenge-of-the-vit---81.4%
pay-attention-to-mlps-73M-81.6%
the-effectiveness-of-mae-pre-pretraining-for-6500M-90.1%
maxvit-multi-axis-vision-transformer---88.7%
xcit-cross-covariance-image-transformers-48M-85.6%
visual-attention-network---87%
efficientnet-rethinking-model-scaling-for-19M-82.6%
revisiting-resnets-improved-training-and---83.8%
multigrain-a-unified-image-embedding-for---82.6%
understanding-the-robustness-in-vision-76.8M-87.1%
zen-nas-a-zero-shot-nas-for-high-performance-5.7M-78%
learning-transferable-architectures-for1648G88.9M2.38G82.7%
rexnet-diminishing-representational-19M-81.6%
vision-gnn-an-image-is-worth-graph-of-nodes-27.3M-82.1%
dat-spatially-dynamic-vision-transformer-with-94M-85.9%
deit-iii-revenge-of-the-vit---83.8%
Model 529---67.63
bnn-bn-training-binary-neural-networks-68.0%
fixing-the-train-test-resolution-discrepancy-2---85.7%
mambavision-a-hybrid-mamba-transformer-vision-35.1M-82.7%
attentive-normalization---81.87%
alphanet-improved-training-of-supernet-with---80.8%
metaformer-baselines-for-vision-27M-83.7%
semi-supervised-recognition-under-a-noisy-and76M85.1%
hrformer-high-resolution-transformer-for-8.0M-78.5%
which-transformer-to-favor-a-comparative---81.09%
transboost-improving-the-best-imagenet-11.69M-73.36%
metaformer-baselines-for-vision-26M-85.0%
a-large-batch-optimizer-reality-check-75.92%
lets-keep-it-simple-using-simple-3M-68.15
deep-residual-learning-for-image-recognition-25M-75.3%
centroid-transformers-learning-to-abstract-22.3M-80.9%
transnext-robust-foveal-visual-perception-for-89.7M-86.2%
going-deeper-with-image-transformers-89.5M-85.3%
dilated-neighborhood-attention-transformer-200M-87.5%
supervised-contrastive-learning-80.8%
metaformer-baselines-for-vision-57M-86.1%
asymmetric-masked-distillation-for-pre-87M-84.6%
davit-dual-attention-vision-transformers-87.9M-84.6%
tiny-models-are-the-computational-saver-for---85.24
mobilevitv3-mobile-friendly-vision---78.64%
internimage-exploring-large-scale-vision-3000M-90.1%
deit-iii-revenge-of-the-vit---83.1%
when-shift-operation-meets-vision-transformer-28M-81.7%
grafit-learning-fine-grained-image-79.6%
mnasnet-platform-aware-neural-architecture-3.9M-75.2%
tokenlearner-what-can-8-learned-tokens-do-for---87.07%
neural-architecture-transfer-9.1M-80.5%
distilling-out-of-distribution-robustness-1---81.9%
dynamicvit-efficient-vision-transformers-with57.1M83.9
rethinking-spatial-dimensions-of-vision-10.6M-79.1%
bag-of-tricks-for-image-classification-with-25M-77.16%
deit-iii-revenge-of-the-vit---84.9%
mnasnet-platform-aware-neural-architecture5.2M0.0403G76.7%
fixing-the-train-test-resolution-discrepancy-2-19M-85.9%
efficientnet-rethinking-model-scaling-for-7.8M-78.8%
Model 569---81.97%
localvit-bringing-locality-to-vision-4.3M-72.5%
tinyvit-fast-pretraining-distillation-for-21M-86.2%
single-path-nas-designing-hardware-efficient---74.96%
volo-vision-outlooker-for-visual-recognition-86M-86.3%
levit-a-vision-transformer-in-convnet-s-17.8M-81.6%
multigrain-a-unified-image-embedding-for---75.1%
fixing-the-train-test-resolution-discrepancy-2480M88.5%
dynamic-convolution-attention-over-2.8M-64.9%
deit-iii-revenge-of-the-vit-304.8M-85.8%
high-performance-large-scale-image-438.4M-86.5%
deepvit-towards-deeper-vision-transformer-55M-82.2%
biformer-vision-transformer-with-bi-level---81.4%
metaformer-baselines-for-vision-26M-85.4%
selective-kernel-networks-48.9M-79.81%
going-deeper-with-image-transformers-17.3M-82.2%
unireplknet-a-universal-perception-large---87.9%
efficientnetv2-smaller-models-and-faster---85.7%
visformer-the-vision-friendly-transformer-40.2M-82.2%
boosting-discriminative-visual-representation-21.8M-76.35%
online-training-through-time-for-spiking---65.15%
pattern-attention-transformer-with-doughnut---83.1%
unconstrained-open-vocabulary-image---83.46%
improved-multiscale-vision-transformers-for-667M-88.8%
spatial-channel-token-distillation-for-vision-22.2M-75.7%
transboost-improving-the-best-imagenet-21.8M-76.70%
tresnet-high-performance-gpu-dedicated77M84.3%
metaformer-baselines-for-vision-99M-86.4%
vision-gnn-an-image-is-worth-graph-of-nodes-51.7M-83.1%
maxvit-multi-axis-vision-transformer---86.34%
mobilevit-light-weight-general-purpose-and-2.3M-74.8%
mixpro-data-augmentation-with-maskmix-and---84.1%
visual-attention-network-26.6M-82.8%
visual-attention-network-90M-86.3%
unireplknet-a-universal-perception-large---83.9%
levit-a-vision-transformer-in-convnet-s-4.7M-75.7%
lambdanetworks-modeling-long-range-1-35M-84.0%
rckd-response-based-cross-task-knowledge-3M-78.6
incorporating-convolution-designs-into-visual---78.8%
greedynas-towards-fast-one-shot-nas-with-4.7M-76.2%
deeper-vs-wider-a-revisit-of-transformer---84.2
2103-15358-6.7M-76.7%
fbnetv5-neural-architecture-search-for---81.8%
x-volution-on-the-unification-of-convolution---75%
metaformer-baselines-for-vision-56M-86.6%
parametric-contrastive-learning---81.3%
re-labeling-imagenet-from-single-to-multi-4.8M-78.4%
self-training-with-noisy-student-improves-43M-86.4%
transboost-improving-the-best-imagenet-44.55M-79.86%
swin-transformer-v2-scaling-up-capacity-and-3000M-90.17%
deit-iii-revenge-of-the-vit-87M-85.0%
mlp-mixer-an-all-mlp-architecture-for-vision---85.3%
container-context-aggregation-network-22.1M-82.7%
graph-convolutions-enrich-the-self-attention---81.1%
contextual-convolutional-neural-networks-60M-79.03%
tokens-to-token-vit-training-vision---82.3%
conformer-local-features-coupling-global83.3M84.1%
differentiable-spike-rethinking-gradient---71.24
which-transformer-to-favor-a-comparative---78.42%
sequencer-deep-lstm-for-image-classification-54M-83.4%
mobilenetv4-universal-models-for-the-mobile---80.7%
which-transformer-to-favor-a-comparative---84.91%
collaboration-of-experts-achieving-80-top-1---80%
fixing-the-train-test-resolution-discrepancy-2-19M-84.0%
co-training-2-l-submodels-for-visual---85.8%
efficientvit-enhanced-linear-attention-for-49M-84.2%
designing-network-design-spaces-6.1M-75.5%
beit-bert-pre-training-of-image-transformers-331M-88.60%
lets-keep-it-simple-using-simple-9.5M-81.24
deeper-vs-wider-a-revisit-of-transformer---87.1
fastervit-fast-vision-transformers-with-424.6M-85.4%
tiny-models-are-the-computational-saver-for---86.24
cas-vit-convolutional-additive-self-attention-12.42-83.0%
compounding-the-performance-improvements-of---84.2%
swin-transformer-hierarchical-vision-197M-87.3%
towards-all-in-one-pre-training-via---89.6%
colornet-investigating-the-importance-of---84.32%
nasvit-neural-architecture-search-for---81.8%
boosting-discriminative-visual-representation-25.6M-79.41%
muxconv-information-multiplexing-in-3.4M-75.3%
sp-vit-learning-2d-spatial-priors-for-vision---86.3%
localvit-bringing-locality-to-vision-13.5M-78.2%
efficientnetv2-smaller-models-and-faster---85.1%
rethinking-spatial-dimensions-of-vision-23.5M-81.9%
eca-net-efficient-channel-attention-for-deep-24.37M-77.48%
progressive-neural-architecture-search-86.1M2.5G82.9%
deit-iii-revenge-of-the-vit-22M-83.4%
levit-a-vision-transformer-in-convnet-s-8.8M-79.6%
transformer-in-transformer65.6M83.9%
when-vision-transformers-outperform-resnets-64M-79%
resnest-split-attention-networks-111M-84.5%
deit-iii-revenge-of-the-vit---85.2%
bottleneck-transformers-for-visual-28.02M-79.4%
fbnet-hardware-aware-efficient-convnet-design-5.5M-74.9%
dilated-neighborhood-attention-transformer-197M-87.4%
maxvit-multi-axis-vision-transformer---89.41%
convmlp-hierarchical-convolutional-mlps-for-17.4M-79%
asymmetric-masked-distillation-for-pre-22M-82.1%
masked-autoencoders-are-scalable-vision-656M-87.8%
mobilenetv4-universal-models-for-the-mobile---73.8%
circumventing-outliers-of-autoaugment-with-66M-85.5%
asymmnet-towards-ultralight-convolution-5.99M-75.4%
hvt-a-comprehensive-vision-framework-for---80.1%
190409460---77.8%
mobilevitv3-mobile-friendly-vision-1.4M-72.33%
shufflenet-an-extremely-efficient---70.9%
autoformer-searching-transformers-for-visual-22.9M-81.7%
going-deeper-with-image-transformers-270.9M-86.1%
transboost-improving-the-best-imagenet-5.29M-78.60%
learning-visual-representations-for-transfer-1-76.71%
which-transformer-to-favor-a-comparative---78.34%
meta-pseudo-labels-390M-90%
unireplknet-a-universal-perception-large---81.6%
vitae-vision-transformer-advanced-by-4.8M-76.8%
pvtv2-improved-baselines-with-pyramid-vision-13.1M-78.7%
asymmnet-towards-ultralight-convolution-2.8M-69.2%
scaling-vision-with-sparse-mixture-of-experts-656M-88.08%
efficient-multi-order-gated-aggregation-5.2M-80%
tinyvit-fast-pretraining-distillation-for-5.4M-80.7%
aggregating-nested-transformers-68M-83.8%
exploring-the-limits-of-weakly-supervised-88M-82.2%
co-training-2-l-submodels-for-visual---83.1%
cvt-introducing-convolutions-to-vision-18M-82.2%
uniformer-unifying-convolution-and-self-100M-85.6%
swin-transformer-hierarchical-vision-29M-81.3%
resmlp-feedforward-networks-for-image-15.4M-77.8%
metaformer-baselines-for-vision-40M-85.4%
a-simple-episodic-linear-probe-improves---76.13
vicinity-vision-transformer-61.8M-84.1%
patches-are-all-you-need-1-51.6M-82.20
dicenet-dimension-wise-convolutions-for---75.1%
perceiver-general-perception-with-iterative---76.4%
davit-dual-attention-vision-transformers-196.8M-87.5%
omnivore-a-single-model-for-many-visual---85.3%
autodropout-learning-dropout-patterns-to---78.7%
a-convnet-for-the-2020s-198M-85.5%
efficientvit-enhanced-linear-attention-for---83.5%
resnet-strikes-back-an-improved-training-22M-80.4%
eva-exploring-the-limits-of-masked-visual-1000M-89.7%
multiscale-deep-equilibrium-models81M79.2%
when-shift-operation-meets-vision-transformer-50M-82.8%
exploring-randomly-wired-neural-networks-for-5.6M-74.7%
splitnet-divide-and-co-training-98M-83.6%
three-things-everyone-should-know-about---82.6%
alphanet-improved-training-of-supernet-with---79.1%
dynamic-convolution-attention-over-7M-72.8%
x-volution-on-the-unification-of-convolution-76.6%
global-context-vision-transformers-12M-79.8%
uninet-unified-architecture-search-with-1-117M-87.4%
shufflenet-v2-practical-guidelines-for---75.4%
dropblock-a-regularization-method-for---78.35%
greedynas-towards-fast-one-shot-nas-with-5.2M-76.8%
alphanet-improved-training-of-supernet-with---79.4%
differentially-private-image-classification---88.9%
escaping-the-big-data-paradigm-with-compact-22.36M--
visual-attention-network-13.9M-81.1%
fbnetv5-neural-architecture-search-for---81.7%
multigrain-a-unified-image-embedding-for---78.2%
a-fast-knowledge-distillation-framework-for---81.9%
fast-vision-transformers-with-hilo-attention-49M-83.3%
resnet-strikes-back-an-improved-training-60.2M-82.4%
scarletnas-bridging-the-gap-between-6.7M-76.9%
fastvit-a-fast-hybrid-vision-transformer---79.8%
maxvit-multi-axis-vision-transformer-31M-83.62%
an-improved-one-millisecond-mobile-backbone-10.1M-78.1%
2103-14899-28.2M-82.3%
randaugment-practical-data-augmentation-with-66M-85%
mixnet-mixed-depthwise-convolutional-kernels-7.3M-78.9%
sliced-recursive-transformer-1-4.8M-77.6%
muxconv-information-multiplexing-in-1.8M-66.7%
clcnet-rethinking-of-ensemble-modeling-with---83.88%
high-performance-large-scale-image-132.6M-84.7%
model-rubik-s-cube-twisting-resolution-depth-11.9M-79.4%
tokens-to-token-vit-training-vision---81.9%
dynamic-convolution-attention-over-4.8M-69.7%
moat-alternating-mobile-convolution-and-190M-86.7%
go-wider-instead-of-deeper-40M-79.49%
repvgg-making-vgg-style-convnets-great-again-55.77M-78.5%
metaformer-baselines-for-vision-39M-86.9%
co-training-2-l-submodels-for-visual---84.2%
metaformer-baselines-for-vision-26M-84.1%
davit-dual-attention-vision-transformers-362M-90.2%
transboost-improving-the-best-imagenet-5.48M-76.81%
glit-neural-architecture-search-for-global-96.1M-82.3%
pyramidal-convolution-rethinking42.3M81.49%
spatial-channel-token-distillation-for-vision-30.1M-82.1%
high-performance-large-scale-image-316.1M-85.9%
multimodal-autoregressive-pre-training-of-2700M--
vision-transformer-with-deformable-attention-29M-82.0%
eca-net-efficient-channel-attention-for-deep-3.34M-72.56%
maxvit-multi-axis-vision-transformer---89.12%
three-things-everyone-should-know-about---84.3%
twins-revisiting-spatial-attention-design-in-99.2M-83.7%
visual-attention-network---85.7%
autoformer-searching-transformers-for-visual-5.7M-74.7%
fixing-the-train-test-resolution-discrepancy---79.8%
moat-alternating-mobile-convolution-and-27.8M-83.3%
revbifpn-the-fully-reversible-bidirectional-10.6M-79%
designing-network-design-spaces-11.2M-78%
tiny-models-are-the-computational-saver-for---83.52
efficientvit-enhanced-linear-attention-for-53M-84.5%
rethinking-spatial-dimensions-of-vision-73.8M-84%
visual-attention-network-200M-87.8%
ghostnetv3-exploring-the-training-strategies---69.4%
edgenext-efficiently-amalgamated-cnn-5.6M-79.4%
edgeformer-improving-light-weight-convnets-by-5M-78.63%
billion-scale-semi-supervised-learning-for-88M-84.3%
vision-gnn-an-image-is-worth-graph-of-nodes-92.6M-83.7%
global-context-vision-transformers-51M-84.0%
cvt-introducing-convolutions-to-vision---87.7%
fixing-the-train-test-resolution-discrepancy-2-66M-87.1%
autodropout-learning-dropout-patterns-to---77.5%
maxvit-multi-axis-vision-transformer---85.72%
designing-network-design-spaces-39.2M-79.9%
ghostnet-more-features-from-cheap-operations-6.5M-74.1%
involution-inverting-the-inherence-of-12.4M-77.6%
rexnet-diminishing-representational-9.7M-80.3%
adversarial-examples-improve-image-88M-85.5%
gtp-vit-efficient-vision-transformers-via---81.5%
Model 788---81.16%
resmlp-feedforward-networks-for-image-45M-79.7%
unireplknet-a-universal-perception-large---80.2%
biformer-vision-transformer-with-bi-level---85.4%
tokenlearner-what-can-8-learned-tokens-do-for-460M-88.87%
lets-keep-it-simple-using-simple-5.7M-71.94
rest-an-efficient-transformer-for-visual-51.63M-83.6%
ghostnet-more-features-from-cheap-operations-7.3M-75.7%
co-training-2-l-submodels-for-visual---85.0%
inception-v4-inception-resnet-and-the-impact-55.8M-80.1%
deep-residual-learning-for-image-recognition---78.57%
distilled-gradual-pruning-with-pruned-fine-1.15M-65.22
sliced-recursive-transformer-1-4M-74.0%
beit-bert-pre-training-of-image-transformers-86M-86.3%
metaformer-baselines-for-vision-56M-87.5%
designing-network-design-spaces-4.3M-74.1%
Model 804---78.62%
localvit-bringing-locality-to-vision-6.3M-75.9%
metaformer-baselines-for-vision-56M-86.2%
coatnet-marrying-convolution-and-attention---88.52%
tinyvit-fast-pretraining-distillation-for-21M-83.1%
debiformer-vision-transformer-with-deformable---81.9%
metaformer-baselines-for-vision-100M-87.0%
going-deeper-with-image-transformers-38.6M-84.8%
drawing-multiple-augmentation-samples-per377.2M86.78%
multimodal-autoregressive-pre-training-of---89.5%
scaling-vision-transformers-to-22-billion-307M-89.6%
metaformer-is-actually-what-you-need-for-73M-82.5%
visual-attention-network-4.1M-75.4%
gtp-vit-efficient-vision-transformers-via---79.5%
sliced-recursive-transformer-1-21M-83.8%
ghostnet-more-features-from-cheap-operations-5.2M-73.9%
lip-local-importance-based-pooling-25.8M-78.15%
elsa-enhanced-local-self-attention-for-vision-27M-84.7%
mnasnet-platform-aware-neural-architecture-4.8M-75.6%
glit-neural-architecture-search-for-global-7.2M-76.3%
autoformer-searching-transformers-for-visual-54M-82.4%
which-transformer-to-favor-a-comparative---71.53%
fixing-the-train-test-resolution-discrepancy-2-7.8M-82.6%
automix-unveiling-the-power-of-mixup-11.7M-72.05%
deepvit-towards-deeper-vision-transformer---83.1%
unireplknet-a-universal-perception-large---77%
knowledge-distillation-a-good-teacher-is-82.8%
cas-vit-convolutional-additive-self-attention-5.76-81.1%
mobilevitv3-mobile-friendly-vision-3M-76.55%
internimage-exploring-large-scale-vision-30M-83.5%
meta-pseudo-labels---83.2%
training-data-efficient-image-transformers87M85.2%
mobilevitv3-mobile-friendly-vision-2.5M-76.7%
vicinity-vision-transformer-61.8M-84.7%
a-dot-product-attention-free-transformer-23M-80.1%
regularized-evolution-for-image-classifier-469M-83.9%
an-image-is-worth-16x16-words-transformers-1---88.55%
coatnet-marrying-convolution-and-attention-168M-84.5%
evo-vit-slow-fast-token-evolution-for-dynamic-39.6M-82.2%
swin-transformer-v2-scaling-up-capacity-and-88M-87.1%
sequencer-deep-lstm-for-image-classification-38M-82.8%
maxvit-multi-axis-vision-transformer---88.51%
cspnet-a-new-backbone-that-can-enhance-20.5M-79.8%
clcnet-rethinking-of-ensemble-modeling-with---86.46%
spatial-group-wise-enhance-improving-semantic-44.55M-78.798%
2103-15358-79M-81.9%
mobilenetv4-universal-models-for-the-mobile---83.4%
efficient-multi-order-gated-aggregation-44M-84.3%
revbifpn-the-fully-reversible-bidirectional-82M-83.7%
a-fast-knowledge-distillation-framework-for---80.1%
uninet-unified-architecture-search-with-22.5M-82.7%
collaboration-of-experts-achieving-80-top-1-95.3M-80.7%
self-training-with-noisy-student-improves51800G480M88.4%
gpipe-efficient-training-of-giant-neural---84.4%
proxylessnas-direct-neural-architecture-4.0M-74.6%
densenets-reloaded-paradigm-shift-beyond-24M-82.8%
efficient-multi-order-gated-aggregation-181M-87.8%
go-wider-instead-of-deeper-63M-80.09%
a-convnet-for-the-2020s-29M-82.1%
rethinking-the-design-principles-of-robust-91.8M-82.7%
densenets-reloaded-paradigm-shift-beyond-186M-84.8%
deit-iii-revenge-of-the-vit---86.7%
metaformer-baselines-for-vision-100M-84.8%
training-data-efficient-image-transformers-5M-76.6%
vitae-vision-transformer-advanced-by-6.5M-77.9%
harmonic-convolutional-networks-based-on-88.2M-82.85%
bottleneck-transformers-for-visual-75.1M-84.7%
bottleneck-transformers-for-visual---84.2%
the-effectiveness-of-mae-pre-pretraining-for---88.8%
resmlp-feedforward-networks-for-image-30M-80.8%
hyenapixel-global-image-context-with---83.6%
mambavision-a-hybrid-mamba-transformer-vision-50.1M-83.3%
resnet-strikes-back-an-improved-training-60.2M-81.8%
multiscale-vision-transformers-37M-83.0%
multimodal-autoregressive-pre-training-of-300M-86.6%
sharpness-aware-minimization-for-efficiently-1-480M-88.61%
bottleneck-transformers-for-visual-54.7M-82.8%
internimage-exploring-large-scale-vision-223M-87.7%
nasvit-neural-architecture-search-for---82.9%
ghostnetv3-exploring-the-training-strategies---77.1%
fbnetv5-neural-architecture-search-for---82.6%
lets-keep-it-simple-using-simple-5.7M-79.12
levit-a-vision-transformer-in-convnet-s-10.4M-80%
transnext-robust-foveal-visual-perception-for-49.7M-84.7%
the-effectiveness-of-mae-pre-pretraining-for-2000M-89.8%
metaformer-baselines-for-vision-57M-86.9%
hyenapixel-global-image-context-with---83.2%
meta-pseudo-labels95040G480M90.2%
improved-multiscale-vision-transformers-for-24M-82.3%
coatnet-marrying-convolution-and-attention-25M-81.6%
data2vec-a-general-framework-for-self-1-656M-86.6%
internimage-exploring-large-scale-vision-1080M-89.6%
colormae-exploring-data-independent-masking---83.8%
maxvit-multi-axis-vision-transformer---86.19%
kolmogorov-arnold-transformer-86.6M-79.1
cvt-introducing-convolutions-to-vision---83.3%
vision-transformer-with-deformable-attention-88M-84.8%
dilated-neighborhood-attention-transformer-28M-82.7%
which-transformer-to-favor-a-comparative---80.66%
hiera-a-hierarchical-vision-transformer---86.9%
an-improved-one-millisecond-mobile-backbone-7.8M-79.1%
metaformer-baselines-for-vision-27M-84.4%
multigrain-a-unified-image-embedding-for---81.3%
xcit-cross-covariance-image-transformers-26M-85.1%
residual-attention-network-for-image---80.5%
fairnas-rethinking-evaluation-fairness-of-4.5M-75.10%
beyond-self-attention-external-attention-81.7%
minivit-compressing-vision-transformers-with-47M-85.5%
learned-queries-for-efficient-local-attention-16M-81.7%
token-labeling-training-a-85-5-top-1-accuracy-151M-86.4%
meal-v2-boosting-vanilla-resnet-50-to-80-top-25.6M-81.72%
how-to-use-dropout-correctly-on-residual---79.152%
densenets-reloaded-paradigm-shift-beyond-186M-85.8%
cvt-introducing-convolutions-to-vision---82.5%
activemlp-an-mlp-like-architecture-with-76.4M-84.8%
moga-searching-beyond-mobilenetv35.1M0.0304G75.9%
mlp-mixer-an-all-mlp-architecture-for-vision-87.94%
fixing-the-train-test-resolution-discrepancy-2-12M-85%
mixnet-mixed-depthwise-convolutional-kernels-5.0M-77%
learned-queries-for-efficient-local-attention-56M-83.7%
ghostnetv3-exploring-the-training-strategies---80.4%
ghostnetv3-exploring-the-training-strategies---79.1%
190409460---79.03%
searching-for-mobilenetv3-5.4M-75.2%
clcnet-rethinking-of-ensemble-modeling-with---85.28%
convit-improving-vision-transformers-with-6M-73.1%
eca-net-efficient-channel-attention-for-deep-42.49M-78.65%
metaformer-baselines-for-vision-99M-85.5%
fastervit-fast-vision-transformers-with-957.5M-85.6%
transboost-improving-the-best-imagenet-22.05M-83.67%
fastervit-fast-vision-transformers-with-31.4M-82.1%
hyenapixel-global-image-context-with---85.2%
which-transformer-to-favor-a-comparative---82.22%
resnet-strikes-back-an-improved-training-25M-78.1%
balanced-binary-neural-networks-with-gated---62.6%
from-xception-to-nexception-new-design---81.5%
gswin-gated-mlp-vision-model-with-21.8M-81.71%
semantic-aware-local-global-vision-6.5M-75.9%
mobilenetv2-inverted-residuals-and-linear-3.4M-72%
incepformer-efficient-inception-transformer-39.3M-83.6%
volo-vision-outlooker-for-visual-recognition-193M-86.8%
tiny-models-are-the-computational-saver-for---85.74
which-transformer-to-favor-a-comparative---83.09%
metaformer-baselines-for-vision-99M-88.1%
multimodal-autoregressive-pre-training-of-1200M-88.1%
revisiting-unreasonable-effectiveness-of-data---79.2%
going-deeper-with-image-transformers-12M-80.9%
pvtv2-improved-baselines-with-pyramid-vision-82M-83.8%
hvt-a-comprehensive-vision-framework-for---85%
metaformer-baselines-for-vision-99M-87.4%
averaging-weights-leads-to-wider-optima-and---78.94%
differentiable-top-k-classification-learning-1---88.37%
fixing-the-train-test-resolution-discrepancy-2-43M-86.7%
fastvit-a-fast-hybrid-vision-transformer---84.5%
coatnet-marrying-convolution-and-attention---87.1%
sp-vit-learning-2d-spatial-priors-for-vision---86%
mobilenetv2-inverted-residuals-and-linear-6.9M-74.7%
mobilenetv4-universal-models-for-the-mobile---82.9%
masked-autoencoders-are-scalable-vision---83.6%
vitae-vision-transformer-advanced-by-19.2M-82.2%
convit-improving-vision-transformers-with-27M-81.3%
fastervit-fast-vision-transformers-with-1360M-85.8%
resnet-strikes-back-an-improved-training-25M-80.4%
training-data-efficient-image-transformers-22M-82.6%
incepformer-efficient-inception-transformer-24.3M-82.9%
soft-conditional-computation---78.3%
transnext-robust-foveal-visual-perception-for-28.2M-84.0%
go-wider-instead-of-deeper-29M-77.54%
multigrain-a-unified-image-embedding-for---82.7%
an-improved-one-millisecond-mobile-backbone-4.8M-77.4%
large-scale-learning-of-general-visual---87.54%
augmenting-convolutional-networks-with-25.2M-82.1%
dynamic-convolution-attention-over-42.7M-72.7%
efficientnetv2-smaller-models-and-faster---83.9%
fastvit-a-fast-hybrid-vision-transformer---79.1%
revit-enhancing-vision-transformers-with---82.4
bottleneck-transformers-for-visual-33.5M-81.7%
self-knowledge-distillation-a-simple-way-for---79.24%
resmlp-feedforward-networks-for-image-116M-83.6%
tinyvit-fast-pretraining-distillation-for-5.4M-79.1%
190409460---79.38%
densely-connected-convolutional-networks---77.85%
augmenting-convolutional-networks-with-47.7M-83.2%
fast-vision-transformers-with-hilo-attention---83.6%
glit-neural-architecture-search-for-global-24.6M-80.5%
fixing-the-train-test-resolution-discrepancy-2-5.3M-80.2%
sharpness-aware-minimization-for-efficiently-1---81.6%
high-performance-large-scale-image-377.2M-86.3%
tokens-to-token-vit-training-vision-39.2M-82.2%
densely-connected-search-space-for-more---75.9%
the-effectiveness-of-mae-pre-pretraining-for---86.8%
davit-dual-attention-vision-transformers-1437M-90.4%
tinyvit-fast-pretraining-distillation-for-21M-86.5%
not-all-images-are-worth-16x16-words-dynamic---79.74%
graph-convolutions-enrich-the-self-attention---81.5%
efficientnetv2-smaller-models-and-faster-208M-87.3%
res2net-a-new-multi-scale-backbone---81.23%
internimage-exploring-large-scale-vision-335M-88%
fastvit-a-fast-hybrid-vision-transformer---75.6%
torchdistill-a-modular-configuration-driven---71.37%
adversarial-autoaugment-1---81.32%
bottleneck-transformers-for-visual-25.5M-78.8%
efficientnetv2-smaller-models-and-faster-120M-86.8%
rexnet-diminishing-representational-7.6M-79.5%
designing-network-design-spaces-20.6M-79.4%
maxvit-multi-axis-vision-transformer---88.82%
revisiting-weakly-supervised-pre-training-of-633.5M-88.6%
clcnet-rethinking-of-ensemble-modeling-with---86.61%
efficientnet-rethinking-model-scaling-for-30M-83.3%
unireplknet-a-universal-perception-large---78.6%
masked-autoencoders-are-scalable-vision---86.9%
lambdanetworks-modeling-long-range-142M84.3%
three-things-everyone-should-know-about---82.3%
repmlpnet-hierarchical-vision-mlp-with-re---81.8%
densenets-reloaded-paradigm-shift-beyond-87M-84.4%
rethinking-and-improving-relative-position-87M-82.4%
metaformer-baselines-for-vision-57M-85.6%
metaformer-baselines-for-vision-39M-85.7%
maxvit-multi-axis-vision-transformer---88.38%
meta-knowledge-distillation---77.1%
sequencer-deep-lstm-for-image-classification-54M-84.6%
Model 1025---78.8
fixing-the-train-test-resolution-discrepancy-2-30M-86.4%
meal-v2-boosting-vanilla-resnet-50-to-80-top---73.19%
meta-knowledge-distillation---86.5%
transboost-improving-the-best-imagenet-28.59M-82.46%
improved-multiscale-vision-transformers-for-218M-88.4%
aggregated-residual-transformations-for-deep-83.6M-80.9%
deeper-vs-wider-a-revisit-of-transformer---86.3
an-improved-one-millisecond-mobile-backbone-7.8-77.4%
deep-polynomial-neural-networks---77.17%
improving-vision-transformers-by-revisiting-295.5M-87.3%
scalable-visual-transformers-with-21.74M-78.00%
revbifpn-the-fully-reversible-bidirectional-19.6M-81.1%
localvit-bringing-locality-to-vision-5.9M-74.8%
contextual-transformer-networks-for-visual-23.1M-81.6%
maxvit-multi-axis-vision-transformer---89.36%
mlp-mixer-an-all-mlp-architecture-for-vision-46M-76.44%
rexnet-diminishing-representational-4.1M-77.2%
an-intriguing-failing-of-convolutional-neural---75.74%
efficientvit-enhanced-linear-attention-for-24M-82.7%
scaling-vision-with-sparse-mixture-of-experts-3400M-87.41%
lip-local-importance-based-pooling-8.7M-76.64%
vitae-vision-transformer-advanced-by-48.5M-83.6%
the-effectiveness-of-mae-pre-pretraining-for-650M-89.5%
multimodal-autoregressive-pre-training-of-600M-87.5%
going-deeper-with-image-transformers-26.6M-84.1%
visual-parser-representing-part-whole---84.2%
augmenting-convolutional-networks-with-334.3M-87.1%
learnable-polynomial-trigonometric-and-28M-82.34
sliced-recursive-transformer-1-71.2M-84.8%
contextual-transformer-networks-for-visual-40.9M-83.2%
model-rubik-s-cube-twisting-resolution-depth-5.1M-77.7%
augmenting-convolutional-networks-with-25.2M-85.4%
meta-knowledge-distillation---85.1%