Efficient ViTs | SOTA | HyperAI

Efficient ViTs aim to enhance the efficiency of Vision Transformers (ViTs) without altering the Transformer architecture. The main techniques include key and query sparsification, token pruning, and token merging. This approach can significantly reduce computational costs and memory consumption while maintaining model performance, thereby improving training and inference speeds on large-scale datasets. It is suitable for real-time image processing and computer vision tasks in resource-constrained environments.

ImageNet-1K (with DeiT-S)

ImageNet-1K (with DeiT-T)

ImageNet-1K (With LV-ViT-S)