Qwen2-VL-2B (finetuned on GAP-VQA train) | 52.43 | Gamified crowd-sourcing of high-quality data for visual fine-tuning | - |
LLaVA-Plus-7B (All Tools) | 27.5±0.3 | LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents | |
InternLM2+ViT (QMoSLoRA) | 35.2 | Mixture-of-Subspaces in Low-Rank Adaptation | |