ONE-PEACE (ViT-G, w/o ranking) | 84.1 | 98.3 | 96.3 | ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities | |
SigLIP (ViT-L, zero-shot) | 70.6 | - | - | Sigmoid Loss for Language Image Pre-Training | |
FLAVA (ViT-B, zero-shot) | 42.74 | - | 76.76 | FLAVA: A Foundational Language And Vision Alignment Model | |