HyperAI
Accueil
Actualités
Articles de recherche récents
Tutoriels
Ensembles de données
Wiki
SOTA
Modèles LLM
Classement GPU
Événements
Recherche
À propos
Français
HyperAI
Toggle sidebar
Rechercher sur le site...
⌘
K
Accueil
SOTA
Visual Question Answering 1
Visual Question Answering On Mm Vet
Visual Question Answering On Mm Vet
Métriques
GPT-4 score
Résultats
Résultats de performance de divers modèles sur ce benchmark
Columns
Nom du modèle
GPT-4 score
Paper Title
Repository
SoM-LLaVA-1.5-T
37.2
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
LLaVA-1.5-7B (VG-S)
40.4
ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
-
MMCTAgent (GPT-4 + GPT-4V)
74.24
MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning
-
DeepSeek-VL
41.5
DeepSeek-VL: Towards Real-World Vision-Language Understanding
VOLCANO 13B
38.0
Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision
LOVA$^3$
35.2
LOVA3: Learning to Visual Question Answering, Asking and Assessment
Qwen2-VL-2B (finetuned on GAP-VQA train)
52.43
Gamified crowd-sourcing of high-quality data for visual fine-tuning
-
mPLUG-Owl2
36.3±0.1
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
LLaVA-Plus-7B (All Tools)
27.5±0.3
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
InternLM2+ViT (QMoSLoRA)
35.2
Mixture-of-Subspaces in Low-Rank Adaptation
JanusFlow
30.9
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
ConvLLaVA
45.9
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models
Mini-Gemini
53.0
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
InfiMM-HD
38.9
InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding
-
InternVL2-26B (SGP, token ratio 9%)
52.10
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs
-
LLaVA-1.5-LLaMA3-8B
37.8
What If We Recaption Billions of Web Images with LLaMA-3?
-
InternVL2-Llama3-76B
64.4
-
-
LLaVA-HR-X
35.5
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models
CoLLaVO
40.3
CoLLaVO: Crayon Large Language and Vision mOdel
-
Silkie
49.9
Silkie: Preference Distillation for Large Visual Language Models
-
0 of 229 row(s) selected.
Previous
Next