InternVL-G-FT (finetuned, w/o ranking) | 97.9 | 100 | 100 | InternVL: Scaling up Vision Foundation Models and Aligning for Generic
Visual-Linguistic Tasks | - |
BLIP-2 ViT-L (zero-shot, 1K test set) | 96.9 | 100 | 100 | BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
Encoders and Large Language Models | - |
InternVL-C-FT (finetuned, w/o ranking) | 97.2 | 100 | 100 | InternVL: Scaling up Vision Foundation Models and Aligning for Generic
Visual-Linguistic Tasks | - |
BLIP-2 ViT-G (zero-shot, 1K test set) | 97.6 | 100 | 100 | BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
Encoders and Large Language Models | - |
ONE-PEACE (finetuned, w/o ranking) | 97.6 | 100 | 100 | ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities | - |