LLaVA-NeXT-Inst-IT-Qwen2-7B (Visual Prompt | 50.5 | 49.0 | Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | - |
GPT-4V-turbo-detail:high (Visual Prompt) | 60.7 | 59.9 | GPT-4 Technical Report | |
LLaVA-1.5-13B (Visual Prompt) | 41.8 | 42.9 | Improved Baselines with Visual Instruction Tuning | |
LLaVA-1.5-13B (Coordinates) | 47.1 | - | Improved Baselines with Visual Instruction Tuning | |
GPT-4V-turbo-detail:low (Visual Prompt) | 52.8 | 51.4 | GPT-4 Technical Report | |
LLaVA-NeXT-Inst-IT-Vicuna-7B (Visual Prompt | 45.1 | 48.2 | Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | - |
ViP-LLaVA-13B (Visual Prompt) | 48.3 | 48.2 | Making Large Language Models Better Data Creators | |