DINOv2 distilled (ViT-S/14) | 21M | 81.1% | DINOv2: Learning Robust Visual Features without Supervision | |
iGPT-XL (64x64, 3072 features) | 6800M | 68.7% | Generative Pretraining from Pixels | |
iGPT-L (48x48) | 1400M | 65.2% | Generative Pretraining from Pixels | |
DINOv2 distilled (ViT-B/14) | 85M | 84.5% | DINOv2: Learning Robust Visual Features without Supervision | |