HyperAI
Home
News
Latest Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
English
HyperAI
Toggle sidebar
Search the site…
⌘
K
Home
SOTA
Visual Question Answering 1
Visual Question Answering On Mm Vet V2
Visual Question Answering On Mm Vet V2
Metrics
GPT-4 score
Results
Performance results of various models on this benchmark
Columns
Model Name
GPT-4 score
Paper Title
Repository
InternVL-Chat-V1-2
45.5±0.1
-
-
InternVL2-Llama3-76B
68.4±0.3
-
-
Claude 3.5 Sonnet (claude-3-5-sonnet-20240620)
71.8±0.2
Claude 3.5 Sonnet Model Card Addendum
-
CogVLM-Chat
45.1±0.2
CogVLM: Visual Expert for Pretrained Language Models
LLaVA-NeXT-34B
50.9±0.1
-
-
Qwen-VL-Max
55.8±0.2
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Emu2-Chat
38.0±0.1
Generative Multimodal Models are In-Context Learners
Otter-9B
23.2±0.1
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
gemini-2.0-flash-exp
77.1±0.1
-
-
InternVL2-40B
63.8±0.2
-
-
Gemini Pro Vision
57.2±0.2
Gemini: A Family of Highly Capable Multimodal Models
OpenFlamingo-9B
17.6±0.2
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
GPT-4o (gpt-4o-2024-11-20)
72.1±0.2
GPT-4 Technical Report
LLaVA-v1.5-13B
33.2±0.1
Improved Baselines with Visual Instruction Tuning
Claude 3 Opus (claude-3-opus-20240229)
55.8±0.2
-
-
InternVL-Chat-V1-5
51.5±0.2
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
-
IXC2-VL-7B
42.5±0.3
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
GPT-4o (gpt-4o-2024-05-13)
71.0±0.2
GPT-4 Technical Report
LLaVA-v1.5-7B
28.3±0.2
Improved Baselines with Visual Instruction Tuning
Qwen2-VL-72B (qwen-vl-max-0809)
66.9±0.3
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
0 of 24 row(s) selected.
Previous
Next