HyperAI超神経

Visual Question Answering On Mm Vet

評価指標

GPT-4 score

評価結果

このベンチマークにおける各モデルのパフォーマンス結果

モデル名
GPT-4 score
Paper TitleRepository
SoM-LLaVA-1.5-T37.2List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
LLaVA-1.5-7B (VG-S)40.4ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models-
MMCTAgent (GPT-4 + GPT-4V)74.24MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning-
DeepSeek-VL41.5DeepSeek-VL: Towards Real-World Vision-Language Understanding
VOLCANO 13B38.0Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision
LOVA$^3$35.2LOVA3: Learning to Visual Question Answering, Asking and Assessment
Qwen2-VL-2B (finetuned on GAP-VQA train)52.43Gamified crowd-sourcing of high-quality data for visual fine-tuning-
mPLUG-Owl236.3±0.1mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
LLaVA-Plus-7B (All Tools)27.5±0.3LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
InternLM2+ViT (QMoSLoRA)35.2Mixture-of-Subspaces in Low-Rank Adaptation
JanusFlow30.9JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
ConvLLaVA45.9ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models
Mini-Gemini53.0Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
InfiMM-HD38.9InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding-
InternVL2-26B (SGP, token ratio 9%)52.10A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs-
LLaVA-1.5-LLaMA3-8B37.8What If We Recaption Billions of Web Images with LLaMA-3?-
InternVL2-Llama3-76B64.4--
LLaVA-HR-X35.5Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models
CoLLaVO40.3CoLLaVO: Crayon Large Language and Vision mOdel-
Silkie49.9Silkie: Preference Distillation for Large Visual Language Models-
0 of 229 row(s) selected.